Unicode
Unicode is a system that assigns numeric codes to writing systems from all over the world.
This page compiles some facts about Unicode that Toaqists may find useful to know.
Unicode vs. UTF-8
Unicode, the standard, dictates (for example) that the character ꝡ is represented by the number 42849, or A761 in hexadecimal.
These "codepoint numbers" are usually written as U+ followed by the hexadecimal number: U+A761.
UTF-8, an encoding, dictates how to encode that number across bytes in a file: it says U+A761 is ea 9d a1
.
There are other encodings of Unicode, but they are not as commonly used. For example, in UTF-16, the encoding of U+A761 is simply a7 61
.
Combining characters and normalization
A letter with a diacritic like é can be represented as a precomposed form (é
U+00E9 latin small letter e with acute) or as a sequence of a base letter and combining characters (e
U+0065 latin small letter e and ◌́
U+0301 combining acute accent).
Unicode text may be normalized to smooth over these differences: either by precomposing everything as much as possible (normalization form C or NFC) or by decomposing everything into combining characters (normalization form D or NFD).
Normalization also pins down the order of combining characters. Underdots come before hats. The string é + underdot
has NFC ẹ + acute
and NFD e + underdot + acute
.
Dotless ı and normalization
The letter í decomposes into i + acute
, not ı + acute
. Placing diacritics on a dotless ı
may produce wrong-looking results. Compare:
Letter | NFC | NFD |
---|---|---|
ı | ı |
ı
|
î | î |
i + circumflex
|
ı̣ | ı + underdot |
ı + underdot
|
ị̂ | ị + circumflex |
i + underdot + circumflex
|
Precomposed tone–underdot combos
Not all Toaq tone–underdot combos have precomposed characters. This table shows precomposed characters in green and NFC forms in red:
a | ạ | ạ́ | ạ̈ | ậ |
---|---|---|---|---|
u | ụ | ụ́ | ụ̈ | ụ̂ |
ı | ı̣ | ị́ | ị̈ | ị̂ |
o | ọ | ọ́ | ọ̈ | ộ |
e | ẹ | ẹ́ | ẹ̈ | ệ |
Paradoxically, depending on the font and operating system, the "abnormal" forms (like é + underdot
) may show up more correctly. They are demonstrated in the table below:
a | ạ | ạ́ | ạ̈ | ậ |
---|---|---|---|---|
u | ụ | ụ́ | ụ̈ | ụ̂ |
ı | ı̣ | ị́ | ị̈ | ị̂ |
o | ọ | ọ́ | ọ̈ | ộ |
e | ẹ | ẹ́ | ẹ̈ | ệ |
(MediaWiki normalizes page contents, meaning that the above table has had to use HTML entities to get the desired effect (e.g. ị́
for ị́). Template:T will do this for you.)