Table Of ContentProgramming with Unicode
Documentation
Release 2011
Victor Stinner
August22,2015
Contents
1 Aboutthisbook 1
1.1 License . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Thanksto . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Unicodenightmare 3
3 Definitions 5
3.1 Character . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 Glyph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.3 Codepoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.4 Characterset(charset) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.5 Characterstring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.6 Bytestring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.7 UTF-8encodedstringsandUTF-16characterstrings . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.8 Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.9 Encodeacharacterstring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.10 Decodeabytestring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.11 Mojibake . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.12 Unicode: anUniversalCharacterSet(UCS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4 Unicode 9
4.1 UnicodeCharacterSet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.2 Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.3 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.4 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
5 Charsetsandencodings 11
5.1 Encodings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5.2 Popularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5.3 Encodingsperformances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.5 Handleundecodablebytesandunencodablecharacters . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.6 Othercharsetsandencodings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
6 Historicalcharsetsandencodings 15
6.1 ASCII. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
6.2 ISO8859family . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
6.3 CJK:asianencodings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
i
6.4 Cyrillic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
7 Unicodeencodings 21
7.1 UTF-8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
7.2 UCS-2,UCS-4,UTF-16andUTF-32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
7.3 UTF-7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
7.4 Byteordermarks(BOM). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
7.5 UTF-16surrogatepairs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
8 Howtoguesstheencodingofadocument? 25
8.1 IsASCII? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
8.2 CheckforBOMmarkers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
8.3 IsUTF-8?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
8.4 Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
9 Goodpractices 29
9.1 Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
9.2 Unicodesupportlevels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
9.3 TesttheUnicodesupportofaprogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
9.4 Gettheencodingofyourinputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
9.5 Switchfrombytestringstocharacterstrings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
10 Operatingsystems 33
10.1 Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
10.2 MacOSX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
10.3 Locales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
10.4 Filesystems(filenames) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
11 Programminglanguages 41
11.1 Clanguage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
11.2 C++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
11.3 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
11.4 PHP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
11.5 Perl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
11.6 Java . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
11.7 GoandD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
12 Libraries 51
12.1 Qtlibrary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
12.2 Thegliblibrary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
12.3 iconvlibrary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
12.4 ICUlibraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
12.5 libunistring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
13 Unicodeissues 55
13.1 Securityvulnerabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
14 Seealso 57
ii
1
CHAPTER
About this book
ThebookiswritteninreStructuredText(reST)syntaxandcompiledbySphinx.
Istartedtowriteinthe25thSeptember2010.
1.1 License
ThisbookisdistributedundertheCCBY-SA3.0license.
1.2 Thanks to
Reviewers: AlexanderBelopolsky,AntoinePitrou,FethArezkiandNelleVaroquaux,NatalNgétal.
1.3 Notations
• 0bBBBBBBBB: 8 bit unsigned number written in binary, first digit is the most significant. For example,
0b10000000is128.
• 0xHHHH:numberwritteninhexadecimal,e.g. 0xFFFFis65535.
• 0xHH 0xHH ...: bytesequencewithbyteswritteninhexadecimal,e.g. 0xC3 0xA9(2bytes)isthechar-
acteré(U+00E9)encodedtoUTF-8.
• U+HHHH:Unicodecharacterwithitscodepointwritteninhexadecimal. Forexample, U+20ACisthe“euro
sign” character, code point 8,364. Big code point are written with more than 4 hexadecimal digits, e.g.
U+10FFFFisthebiggest(unallocated)codepointofUnicodeCharacterSet6.0: 1,114,111.
• A—B:rangeincludingstartandend. Examples:
– 0x00—0x7Fistherange0through127(128bytes)
– U+0000—U+00FFistherange0through255(256characters)
• {U+HHHH,U+HHHH,...}: acharacterstring. Forexample,{U+0041,U+0042,U+0043}isthestring“abc”
(3characters).
1
ProgrammingwithUnicodeDocumentation,Release2011
2 Chapter1. Aboutthisbook
2
CHAPTER
Unicode nightmare
Unicodeisthenightmareofmanydevelopers(andusers)fordifferent,andsometimesgoodreasons.
In the 1980’s, only few people read documents in languages other their mother tongue and English. A computer
supportedonlyasmallnumberoflanguages, theuserconfiguredhisregiontosupportlanguagesofclosecountries.
Memoriesanddiskswereexpensive,allapplicationswerewrittentousebytestringsusing8bitsencodings: onebyte
percharacterwasagoodcompromise.
TodaywiththeInternetandtheglobalization,weallreadandexchangedocumentsfromeverywherearoundtheworld
(even if we don’t understand everything). The problem is that documents rarely indicate their language (encoding),
anddisplayingadocumentwiththewrongencodingleadstoawellknownproblem: mojibake.
It is difficult to get, or worse, guess the encoding of a document. Except for encodings of the UTF family (coming
from the Unicode standard), there is no reliable algorithm for that. We have to rely on statistics to guess the most
probableencoding,whichisdonebymostInternetbrowsers.
Unicodesupport byoperatingsystems, programminglanguagesandlibrariesvariesalot. Ingeneral, thesupportis
basicornon-existent. EachoperatingsystemmanagesUnicodedifferently. Forexample,Windowsstoresfilenamesas
Unicode,whereasUNIXandBSDoperatingsystemsusebytes.
Mixing documents stored as bytes is possible, even if they use different encodings, but leads to mojibake. Because
librariesandprogramsdoalsoignoreencodeanddecodewarningsorerrors,writeasinglecharacterwithadiacritic
(anynon-ASCIIcharacter)issometimesenoughtogetanerror.
Full Unicode support is complex because the Unicode charset is bigger than any other charset. For example, ISO
8859-1 contains 256 code points including 191 characters, whereas Unicode version 6.0 contains 248,966 assigned
codepoints. TheUnicodestandardislargerthanjustacharset: itexplainsalsohowtodisplaycharacters(e.g. left-
to-right for English and right-to-left for persian), how to normalize a character string (e.g. precomposed characters
versusthedecomposedform),etc.
ThisbookexplainshowtosympathizewithUnicode,andhowyoushouldmodifyyourprogramtoavoidmost,orall,
issuesrelatedtoencodingsandUnicode.
3
ProgrammingwithUnicodeDocumentation,Release2011
4 Chapter2. Unicodenightmare
3
CHAPTER
Definitions
3.1 Character
3.2 Glyph
3.3 Code point
Acodepointisanunsignedinteger. Thesmallestcodepointiszero. Codepointsareusuallywrittenashexadecimal,
e.g. “0x20AC”(8,364indecimal).
3.4 Character set (charset)
A character set, abbreviated charset, is a mapping between code points and characters. The mapping has a fixed
size. Forexample, most7bitsencodingshave128entries, andmost8bitsencodingshave256entries. Thebiggest
charsetistheUnicodeCharacterSet6.0with1,114,112entries.
Insomecharsets,codepointsarenotallcontiguous. Forexample,thecp1252charsetmapscodepointsfrom0though
255,butithasonly251entries: 0x81,0x8D,0x8F,0x90and0x9Dcodepointsarenotassigned.
ExamplesoftheASCII charset: thedigitfive(“5”,U+0035)isassignedtothecodepoint0x35(53indecimal),and
theuppercaseletter“A”(U+0041)tothecodepoint0x41(65).
Thebiggestcodepointdependsonthesizeofthecharset. Forexample,thebiggestcodepointoftheASCIIcharsetis
127(27−1)
Charsetexamples:
Charset Codepoint Character
ASCII 0x35 5(U+0035)
ASCII 0x41 A(U+0041)
ISO-8859-15 0xA4 C(U+20AC)
UnicodeCharacterSet 0x20AC C(U+20AC)
3.5 Character string
Acharacterstring,or“Unicodestring”,isastringwhereeachunitisacharacter. Dependingontheimplementation,
eachcharactercanbeanyUnicodecharacter,oronlycharactersintherangeU+0000—U+FFFF,rangecalledtheBasic
5
ProgrammingwithUnicodeDocumentation,Release2011
MultilingualPlane(BMP).Thereare3differentimplementationsofcharacterstrings:
• arrayof32bitsunsignedintegers(theUCS-4encoding): fullUnicoderange
• arrayof16bitsunsignedintegers(UCS-2): BMPonly
• arrayof16bitsunsignedintegerswithsurrogatepairs(UTF-16): fullUnicoderange
UCS-4 use twice as much memory than UCS-2, but it supports all Unicode character. UTF-16 is a compromise
betweenUCS-2andUCS-4: charactersintheBMPrangeuseoneUTF-16unit(16bits),charactersoutsidethisrange
usetwoUTF-16units(asurrogatepair,32bits).Thisadvantageisalsothemaindisadvantageofthiskindofcharacter
string.
The length of a character string implemented using UTF-16 is the number of UTF-16 units, and not the number of
characters, whichisconfusing. Forexample, theU+10FFFFcharacterisencoded astwoUTF-16units: {U+DBFF,
U+DFFF}. If the character string only contains characters of the BMP range, the length is the number of charac-
ters. Getting the nth character or the length in characters using UTF-16 has a complexity of 𝑂(𝑛), whereas it has a
complexityof𝑂(1)forUCS-2andUCS-4strings.
The Java language, the Qt library and Windows 2000 implement character strings with UTF-16. The C and Python
languagesuseUTF-16orUCS-4dependingon:thesizeofthewchar_ttype(16or32bits)forC,andthecompilation
mode(narroworwide)forPython. Windows95usesUCS-2strings.
Seealso:
UCS-2,UCS-4andUTF-16encodings,andsurrogatepairs.
3.6 Byte string
Abytestringisacharacterstringencodedtoanencoding. Itisimplementedasanarrayof8bitsunsignedintegers.
Itcanbecalledbyitsencoding. Forexample,abytestringencodedtoASCII iscalledan“ASCIIencodedstring”,or
simplyan“ASCIIstring”.
Thecharacterrangesupportedbyabytestringdependsonitsencoding,becauseanencodingisassociatedtoacharset.
Forexample,anASCIIstringcanonlystorecharactersintherangeU+0000—U+007F.
Theencodingisnotstoredexplicitlyinabytestring. Iftheencodingisnotdocumentedorattachedtothebytestring,
theencodinghastobeguessed,whichisadifficulttask. Ifabytestringisdecodedfromthewrongencoding,itwill
notbedisplayedcorrectly,leadingtoawellknownissue: mojibake.
The same problem occurs if two byte strings encoded to different encodings are concatenated. Never concatenate
bytestringsencodedtodifferentencodings! Usecharacterstrings,insteadofbytestrings,toavoidmojibakeissues.
PHP5onlysupportsbytestrings. IntheClanguage,“strings”areusuallybytestringswhichareimplementedasthe
char*type(orconst char*).
Seealso:
Thechar*typeoftheClanguageandthemojibakeissue.
3.7 UTF-8 encoded strings and UTF-16 character strings
AUTF-8stringisaparticularcase,becauseUTF-8isabletoencodeallUnicodecharacters1 . ButaUTF-8stringis
notaUnicodestringbecausethestringunitisbyteandnotcharacter: youcangetanindividualbyteofamultibyte
character.
1AUTF-8encodershouldnotencodesurrogatecharacters(U+D800—U+DFFF).
6 Chapter3. Definitions