Unicode is a text encoding standard that assigns numeric codes to writing systems from all over the world.
This page compiles some facts about Unicode that Toaqists may find useful to know.
Unicode vs. UTF-8
Unicode, the standard, dictates (for example) that the character ꝡ is represented by the number 42849, or A761 in hexadecimal.
These "codepoint numbers" are usually written as U+ followed by the hexadecimal number: U+A761.
UTF-8, an encoding, dictates how to encode that number across bytes in a file: it says U+A761 is ea 9d a1
.
There are other encodings of Unicode, but they are not as commonly used. For example, in UTF-16, the encoding of U+A761 is simply a7 61
.
Combining characters and normalization
A letter with a diacritic like é can be represented as a precomposed form (é
U+00E9 latin small letter e with acute) or as a sequence of a base letter and combining characters (e
U+0065 latin small letter e and ◌́
U+0301 combining acute accent).
Unicode text may be normalized to smooth over these differences: either by precomposing everything as much as possible (normalization form C or NFC) or by decomposing everything into combining characters (normalization form D or NFD).
Normalization also pins down the order of combining characters. Underdots come before hats. The string é + underdot
has NFC ẹ + acute
and NFD e + underdot + acute
.
Dotless ı and normalization
The letter í decomposes into i + acute
, not ı + acute
. Placing diacritics on a dotless ı
may produce wrong-looking results. Compare:
Letter | NFC | NFD |
---|---|---|
ı | ı |
ı
|
î | î |
i + circumflex
|
ı̣ | ı + underdot |
ı + underdot
|
ị̂ | ị + circumflex |
i + underdot + circumflex
|
Precomposed tone–underdot combos
Not all Toaq tone–underdot combos have precomposed characters. This table shows precomposed characters in green and NFC forms in red:
a | ạ | ạ́ | ạ̈ | ậ |
---|---|---|---|---|
u | ụ | ụ́ | ụ̈ | ụ̂ |
ı | ı̣ | ị́ | ị̈ | ị̂ |
o | ọ | ọ́ | ọ̈ | ộ |
e | ẹ | ẹ́ | ẹ̈ | ệ |
Paradoxically, depending on the font and operating system, the "abnormal" forms (like é + underdot
) may show up more correctly. They are demonstrated in the table below:
a | ạ | ạ́ | ạ̈ | ậ |
---|---|---|---|---|
u | ụ | ụ́ | ụ̈ | ụ̂ |
ı | ı̣ | ị́ | ị̈ | ị̂ |
o | ọ | ọ́ | ọ̈ | ộ |
e | ẹ | ẹ́ | ẹ̈ | ệ |
(MediaWiki normalizes page contents, meaning that the above table has had to use HTML entities to get the desired effect (e.g. ị́
for ị́). Template:T will do this for you.)