Unicode: Difference between revisions

Revision as of 23:42, 27 May 2024

Unicode is a system that assigns numeric codes to writing systems from all over the world.

This page compiles some facts about Unicode that Toaqists may find useful to know.

Unicode vs. UTF-8

Unicode, the standard, dictates (for example) that the character ꝡ is represented by the number 42849, or A761 in hexadecimal.

These "codepoint numbers" are usually written as U+ followed by the hexadecimal number: U+A761.

UTF-8, an encoding, dictates how to encode that number across bytes in a file: it says U+A761 is ea 9d a1.

There are other encodings of Unicode, but they are not as commonly used. For example, in UTF-16, the encoding of U+A761 is simply a7 61.

Combining characters and normalization

A letter with a diacritic like é can be represented as a precomposed form (é U+00E9 latin small letter e with acute) or as a sequence of a base letter and combining characters (e U+0065 latin small letter e and ◌́ U+0301 combining acute accent).

Unicode text may be normalized to smooth over these differences: either by precomposing everything as much as possible (normalization form C or NFC) or by decomposing everything into combining characters (normalization form D or NFD).

Normalization also pins down the order of combining characters. Underdots come before hats. The string é + underdot has NFC ẹ + acute and NFD e + underdot + acute.

Dotless ı and normalization

The letter í decomposes into i + acute, not ı + acute. Placing diacritics on a dotless ı may produce wrong-looking results. Compare:

Letter	NFC	NFD
ı	`ı`	`ı`
î	`î`	`i + circumflex`
ı̣	`ı + underdot`	`ı + underdot`
ị̂	`ị + circumflex`	`i + underdot + circumflex`

Precomposed tone–underdot combos

Not all Toaq tone–underdot combos have precomposed characters. This table shows precomposed characters in green and NFC forms in red:


a	ạ	ạ́	ạ̈	ậ
u	ụ	ụ́	ụ̈	ụ̂
ı	ı̣	ị́	ị̈	ị̂
o	ọ	ọ́	ọ̈	ộ
e	ẹ	ẹ́	ẹ̈	ệ

Paradoxically, depending on the font and operating system, the "abnormal" forms (like é + underdot) may show up more correctly. They are demonstrated in the table below:


a	ạ	ạ́	ạ̈	ậ
u	ụ	ụ́	ụ̈	ụ̂
ı	ı̣	ị́	ị̈	ị̂
o	ọ	ọ́	ọ̈	ộ
e	ẹ	ẹ́	ẹ̈	ệ

(MediaWiki normalizes page contents, meaning that the above table has had to use HTML entities to get the desired effect (e.g. ị́ for ị́). Template:T will do this for you.)