Unicode
Unicode is a system that assigns numeric codes to writing systems from all over the world.
This page compiles some facts about Unicode that Toaqists may find useful to know.
Unicode vs. UTF-8
Unicode, the standard, dictates (for example) that the character ꝡ is represented by the number 42849, or A761 in hexadecimal.
These "codepoint numbers" are usually written as U+ followed by the hexadecimal number: U+A761.
UTF-8, an encoding, dictates how to encode that number across bytes in a file: it says U+A761 is ea 9d a1.
There are other encodings of Unicode, but they are not as commonly used. For example, in UTF-16, the encoding of U+A761 is simply a7 61.
Combining characters and normalization
A letter with a diacritic like é can be represented as a precomposed form (é U+00E9 latin small letter e with acute) or as a sequence of a base letter and combining characters (e U+0065 latin small letter e and ◌́ U+0301 combining acute accent).
Unicode text may be normalized to smooth over these differences: either by precomposing everything as much as possible (normalization form C or NFC) or by decomposing everything into combining characters (normalization form D or NFD).
Normalization also pins down the order of combining characters. Underdots come before hats. The string é + underdot has NFC ẹ + acute and NFD e + underdot + acute.
Dotless ı and normalization
The letter í decomposes into i + acute, not ı + acute. Placing diacritics on a dotless ı may produce wrong-looking results. Compare:
| Letter | NFC | NFD |
|---|---|---|
| ı | ı |
ı
|
| î | î |
i + circumflex
|
| ı̣ | ı + underdot |
ı + underdot
|
| ị̂ | ị + circumflex |
i + underdot + circumflex
|
Precomposed tone–underdot combos
Not all Toaq tone–underdot combos have precomposed characters. This table shows precomposed characters in green and NFC forms in red:
| a | ạ | ạ́ | ạ̈ | ậ |
|---|---|---|---|---|
| u | ụ | ụ́ | ụ̈ | ụ̂ |
| ı | ı̣ | ị́ | ị̈ | ị̂ |
| o | ọ | ọ́ | ọ̈ | ộ |
| e | ẹ | ẹ́ | ẹ̈ | ệ |
Paradoxically, depending on the font and operating system, the "abnormal" forms (like é + underdot) may show up more correctly. They are demonstrated in the table below:
| a | ạ | ạ́ | ạ̈ | ậ |
|---|---|---|---|---|
| u | ụ | ụ́ | ụ̈ | ụ̂ |
| ı | ı̣ | ị́ | ị̈ | ị̂ |
| o | ọ | ọ́ | ọ̈ | ộ |
| e | ẹ | ẹ́ | ẹ̈ | ệ |
(MediaWiki normalizes page contents, meaning that the above table has had to use HTML entities to get the desired effect (e.g. ị́ for ị́). Template:T will do this for you.)