Unicode

Revision as of 23:42, 27 May 2024 by Laqme (talk | contribs) (Initial article)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Unicode is a system that assigns numeric codes to writing systems from all over the world.

This page compiles some facts about Unicode that Toaqists may find useful to know.

Unicode vs. UTF-8

Unicode, the standard, dictates (for example) that the character is represented by the number 42849, or A761 in hexadecimal.

These "codepoint numbers" are usually written as U+ followed by the hexadecimal number: U+A761.

UTF-8, an encoding, dictates how to encode that number across bytes in a file: it says U+A761 is ea 9d a1.

There are other encodings of Unicode, but they are not as commonly used. For example, in UTF-16, the encoding of U+A761 is simply a7 61.

Combining characters and normalization

A letter with a diacritic like é can be represented as a precomposed form (é U+00E9 latin small letter e with acute) or as a sequence of a base letter and combining characters (e U+0065 latin small letter e and ◌́ U+0301 combining acute accent).

Unicode text may be normalized to smooth over these differences: either by precomposing everything as much as possible (normalization form C or NFC) or by decomposing everything into combining characters (normalization form D or NFD).

Normalization also pins down the order of combining characters. Underdots come before hats. The string é + underdot has NFC ẹ + acute and NFD e + underdot + acute.

Dotless ı and normalization

The letter í decomposes into i + acute, not ı + acute. Placing diacritics on a dotless ı may produce wrong-looking results. Compare:

Letter NFC NFD
ı ı ı
î î i + circumflex
ı̣ ı + underdot ı + underdot
ị̂ ị + circumflex i + underdot + circumflex

Precomposed tone–underdot combos

Not all Toaq tone–underdot combos have precomposed characters. This table shows precomposed characters in green and NFC forms in red:

       
a ạ́ ạ̈
u ụ́ ụ̈ ụ̂
ı ı̣ ị́ ị̈ ị̂
o ọ́ ọ̈
e ẹ́ ẹ̈

Paradoxically, depending on the font and operating system, the "abnormal" forms (like é + underdot) may show up more correctly. They are demonstrated in the table below:

       
a ạ́ ạ̈ ậ
u ụ́ ụ̈ ụ̂
ı ı̣ ị́ ị̈ ị̂
o ọ́ ọ̈ ộ
e ẹ́ ẹ̈ ệ

(MediaWiki normalizes page contents, meaning that the above table has had to use HTML entities to get the desired effect (e.g. ị́ for ị́). Template:T will do this for you.)

See also