Unicode

Revision as of 23:47, 27 May 2024 by Laqme (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Unicode is a text encoding standard that assigns numeric codes to writing systems from all over the world.

This page compiles some facts about Unicode that Toaqists may find useful to know.

Unicode vs. UTF-8

Unicode, the standard, dictates (for example) that the character is represented by the number 42849, or A761 in hexadecimal.

These "codepoint numbers" are usually written as U+ followed by the hexadecimal number: U+A761.

UTF-8, an encoding, dictates how to encode that number across bytes in a file: it says U+A761 is ea 9d a1.

There are other encodings of Unicode, but they are not as commonly used. For example, in UTF-16, the encoding of U+A761 is simply a7 61.

Combining characters and normalization

A letter with a diacritic like é can be represented as a precomposed form (é U+00E9 latin small letter e with acute) or as a sequence of a base letter and combining characters (e U+0065 latin small letter e and ◌́ U+0301 combining acute accent).

Unicode text may be normalized to smooth over these differences: either by precomposing everything as much as possible (normalization form C or NFC) or by decomposing everything into combining characters (normalization form D or NFD).

Normalization also pins down the order of combining characters. Underdots come before hats. The string é + underdot has NFC ẹ + acute and NFD e + underdot + acute.

Dotless ı and normalization

The letter í decomposes into i + acute, not ı + acute. Placing diacritics on a dotless ı may produce wrong-looking results. Compare:

Letter NFC NFD
ı ı ı
î î i + circumflex
ı̣ ı + underdot ı + underdot
ị̂ ị + circumflex i + underdot + circumflex

Precomposed tone–underdot combos

Not all Toaq tone–underdot combos have precomposed characters. This table shows precomposed characters in green and NFC forms in red:

       
a ạ́ ạ̈
u ụ́ ụ̈ ụ̂
ı ı̣ ị́ ị̈ ị̂
o ọ́ ọ̈
e ẹ́ ẹ̈

Paradoxically, depending on the font and operating system, the "abnormal" forms (like é + underdot) may show up more correctly. They are demonstrated in the table below:

       
a ạ́ ạ̈ ậ
u ụ́ ụ̈ ụ̂
ı ı̣ ị́ ị̈ ị̂
o ọ́ ọ̈ ộ
e ẹ́ ẹ̈ ệ

(MediaWiki normalizes page contents, meaning that the above table has had to use HTML entities to get the desired effect (e.g. ị́ for ị́). Template:T will do this for you.)

See also