Table Of Content

Programming with Unicode Documentation Release 2011 Victor Stinner August22,2015 Contents 1 Aboutthisbook 1 1.1 License . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Thanksto . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.3 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 Unicodenightmare 3 3 Definitions 5 3.1 Character . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.2 Glyph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.3 Codepoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.4 Characterset(charset) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.5 Characterstring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.6 Bytestring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.7 UTF-8encodedstringsandUTF-16characterstrings . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.8 Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.9 Encodeacharacterstring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.10 Decodeabytestring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.11 Mojibake . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.12 Unicode: anUniversalCharacterSet(UCS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4 Unicode 9 4.1 UnicodeCharacterSet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 4.2 Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 4.3 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 4.4 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 5 Charsetsandencodings 11 5.1 Encodings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 5.2 Popularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 5.3 Encodingsperformances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 5.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 5.5 Handleundecodablebytesandunencodablecharacters . . . . . . . . . . . . . . . . . . . . . . . . . 12 5.6 Othercharsetsandencodings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 6 Historicalcharsetsandencodings 15 6.1 ASCII. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 6.2 ISO8859family . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 6.3 CJK:asianencodings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 i 6.4 Cyrillic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 7 Unicodeencodings 21 7.1 UTF-8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 7.2 UCS-2,UCS-4,UTF-16andUTF-32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 7.3 UTF-7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 7.4 Byteordermarks(BOM). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 7.5 UTF-16surrogatepairs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 8 Howtoguesstheencodingofadocument? 25 8.1 IsASCII? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 8.2 CheckforBOMmarkers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 8.3 IsUTF-8?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 8.4 Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 9 Goodpractices 29 9.1 Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 9.2 Unicodesupportlevels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 9.3 TesttheUnicodesupportofaprogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 9.4 Gettheencodingofyourinputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 9.5 Switchfrombytestringstocharacterstrings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 10 Operatingsystems 33 10.1 Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 10.2 MacOSX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 10.3 Locales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 10.4 Filesystems(filenames) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 11 Programminglanguages 41 11.1 Clanguage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 11.2 C++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 11.3 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 11.4 PHP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 11.5 Perl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 11.6 Java . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 11.7 GoandD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 12 Libraries 51 12.1 Qtlibrary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 12.2 Thegliblibrary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 12.3 iconvlibrary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 12.4 ICUlibraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 12.5 libunistring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 13 Unicodeissues 55 13.1 Securityvulnerabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 14 Seealso 57 ii 1 CHAPTER About this book ThebookiswritteninreStructuredText(reST)syntaxandcompiledbySphinx. Istartedtowriteinthe25thSeptember2010. 1.1 License ThisbookisdistributedundertheCCBY-SA3.0license. 1.2 Thanks to Reviewers: AlexanderBelopolsky,AntoinePitrou,FethArezkiandNelleVaroquaux,NatalNgétal. 1.3 Notations • 0bBBBBBBBB: 8 bit unsigned number written in binary, first digit is the most significant. For example, 0b10000000is128. • 0xHHHH:numberwritteninhexadecimal,e.g. 0xFFFFis65535. • 0xHH 0xHH ...: bytesequencewithbyteswritteninhexadecimal,e.g. 0xC3 0xA9(2bytes)isthechar- acteré(U+00E9)encodedtoUTF-8. • U+HHHH:Unicodecharacterwithitscodepointwritteninhexadecimal. Forexample, U+20ACisthe“euro sign” character, code point 8,364. Big code point are written with more than 4 hexadecimal digits, e.g. U+10FFFFisthebiggest(unallocated)codepointofUnicodeCharacterSet6.0: 1,114,111. • A—B:rangeincludingstartandend. Examples: – 0x00—0x7Fistherange0through127(128bytes) – U+0000—U+00FFistherange0through255(256characters) • {U+HHHH,U+HHHH,...}: acharacterstring. Forexample,{U+0041,U+0042,U+0043}isthestring“abc” (3characters). 1 ProgrammingwithUnicodeDocumentation,Release2011 2 Chapter1. Aboutthisbook 2 CHAPTER Unicode nightmare Unicodeisthenightmareofmanydevelopers(andusers)fordifferent,andsometimesgoodreasons. In the 1980’s, only few people read documents in languages other their mother tongue and English. A computer supportedonlyasmallnumberoflanguages, theuserconfiguredhisregiontosupportlanguagesofclosecountries. Memoriesanddiskswereexpensive,allapplicationswerewrittentousebytestringsusing8bitsencodings: onebyte percharacterwasagoodcompromise. TodaywiththeInternetandtheglobalization,weallreadandexchangedocumentsfromeverywherearoundtheworld (even if we don’t understand everything). The problem is that documents rarely indicate their language (encoding), anddisplayingadocumentwiththewrongencodingleadstoawellknownproblem: mojibake. It is difficult to get, or worse, guess the encoding of a document. Except for encodings of the UTF family (coming from the Unicode standard), there is no reliable algorithm for that. We have to rely on statistics to guess the most probableencoding,whichisdonebymostInternetbrowsers. Unicodesupport byoperatingsystems, programminglanguagesandlibrariesvariesalot. Ingeneral, thesupportis basicornon-existent. EachoperatingsystemmanagesUnicodedifferently. Forexample,Windowsstoresfilenamesas Unicode,whereasUNIXandBSDoperatingsystemsusebytes. Mixing documents stored as bytes is possible, even if they use different encodings, but leads to mojibake. Because librariesandprogramsdoalsoignoreencodeanddecodewarningsorerrors,writeasinglecharacterwithadiacritic (anynon-ASCIIcharacter)issometimesenoughtogetanerror. Full Unicode support is complex because the Unicode charset is bigger than any other charset. For example, ISO 8859-1 contains 256 code points including 191 characters, whereas Unicode version 6.0 contains 248,966 assigned codepoints. TheUnicodestandardislargerthanjustacharset: itexplainsalsohowtodisplaycharacters(e.g. left- to-right for English and right-to-left for persian), how to normalize a character string (e.g. precomposed characters versusthedecomposedform),etc. ThisbookexplainshowtosympathizewithUnicode,andhowyoushouldmodifyyourprogramtoavoidmost,orall, issuesrelatedtoencodingsandUnicode. 3 ProgrammingwithUnicodeDocumentation,Release2011 4 Chapter2. Unicodenightmare 3 CHAPTER Definitions 3.1 Character 3.2 Glyph 3.3 Code point Acodepointisanunsignedinteger. Thesmallestcodepointiszero. Codepointsareusuallywrittenashexadecimal, e.g. “0x20AC”(8,364indecimal). 3.4 Character set (charset) A character set, abbreviated charset, is a mapping between code points and characters. The mapping has a fixed size. Forexample, most7bitsencodingshave128entries, andmost8bitsencodingshave256entries. Thebiggest charsetistheUnicodeCharacterSet6.0with1,114,112entries. Insomecharsets,codepointsarenotallcontiguous. Forexample,thecp1252charsetmapscodepointsfrom0though 255,butithasonly251entries: 0x81,0x8D,0x8F,0x90and0x9Dcodepointsarenotassigned. ExamplesoftheASCII charset: thedigitfive(“5”,U+0035)isassignedtothecodepoint0x35(53indecimal),and theuppercaseletter“A”(U+0041)tothecodepoint0x41(65). Thebiggestcodepointdependsonthesizeofthecharset. Forexample,thebiggestcodepointoftheASCIIcharsetis 127(27−1) Charsetexamples: Charset Codepoint Character ASCII 0x35 5(U+0035) ASCII 0x41 A(U+0041) ISO-8859-15 0xA4 C(U+20AC) UnicodeCharacterSet 0x20AC C(U+20AC) 3.5 Character string Acharacterstring,or“Unicodestring”,isastringwhereeachunitisacharacter. Dependingontheimplementation, eachcharactercanbeanyUnicodecharacter,oronlycharactersintherangeU+0000—U+FFFF,rangecalledtheBasic 5 ProgrammingwithUnicodeDocumentation,Release2011 MultilingualPlane(BMP).Thereare3differentimplementationsofcharacterstrings: • arrayof32bitsunsignedintegers(theUCS-4encoding): fullUnicoderange • arrayof16bitsunsignedintegers(UCS-2): BMPonly • arrayof16bitsunsignedintegerswithsurrogatepairs(UTF-16): fullUnicoderange UCS-4 use twice as much memory than UCS-2, but it supports all Unicode character. UTF-16 is a compromise betweenUCS-2andUCS-4: charactersintheBMPrangeuseoneUTF-16unit(16bits),charactersoutsidethisrange usetwoUTF-16units(asurrogatepair,32bits).Thisadvantageisalsothemaindisadvantageofthiskindofcharacter string. The length of a character string implemented using UTF-16 is the number of UTF-16 units, and not the number of characters, whichisconfusing. Forexample, theU+10FFFFcharacterisencoded astwoUTF-16units: {U+DBFF, U+DFFF}. If the character string only contains characters of the BMP range, the length is the number of characters. Getting the nth character or the length in characters using UTF-16 has a complexity of 𝑂(𝑛), whereas it has a complexityof𝑂(1)forUCS-2andUCS-4strings. The Java language, the Qt library and Windows 2000 implement character strings with UTF-16. The C and Python languagesuseUTF-16orUCS-4dependingon:thesizeofthewchar_ttype(16or32bits)forC,andthecompilation mode(narroworwide)forPython. Windows95usesUCS-2strings. Seealso: UCS-2,UCS-4andUTF-16encodings,andsurrogatepairs. 3.6 Byte string Abytestringisacharacterstringencodedtoanencoding. Itisimplementedasanarrayof8bitsunsignedintegers. Itcanbecalledbyitsencoding. Forexample,abytestringencodedtoASCII iscalledan“ASCIIencodedstring”,or simplyan“ASCIIstring”. Thecharacterrangesupportedbyabytestringdependsonitsencoding,becauseanencodingisassociatedtoacharset. Forexample,anASCIIstringcanonlystorecharactersintherangeU+0000—U+007F. Theencodingisnotstoredexplicitlyinabytestring. Iftheencodingisnotdocumentedorattachedtothebytestring, theencodinghastobeguessed,whichisadifficulttask. Ifabytestringisdecodedfromthewrongencoding,itwill notbedisplayedcorrectly,leadingtoawellknownissue: mojibake. The same problem occurs if two byte strings encoded to different encodings are concatenated. Never concatenate bytestringsencodedtodifferentencodings! Usecharacterstrings,insteadofbytestrings,toavoidmojibakeissues. PHP5onlysupportsbytestrings. IntheClanguage,“strings”areusuallybytestringswhichareimplementedasthe char*type(orconst char*). Seealso: Thechar*typeoftheClanguageandthemojibakeissue. 3.7 UTF-8 encoded strings and UTF-16 character strings AUTF-8stringisaparticularcase,becauseUTF-8isabletoencodeallUnicodecharacters1 . ButaUTF-8stringis notaUnicodestringbecausethestringunitisbyteandnotcharacter: youcangetanindividualbyteofamultibyte character. 1AUTF-8encodershouldnotencodesurrogatecharacters(U+D800—U+DFFF). 6 Chapter3. Definitions

Programming With Unicode PDF

2011·0.34 MB·English

by Victor Stinner

#additional_collections #folkscanomy #folkscanomy_miscellaneous

Checking for file health...

Save to my drive

Quick download

Download

Download Programming With Unicode PDF Free - Full Version

by Victor Stinner| 2011| 0.34| English

Download Programming With Unicode by Victor Stinner in PDF format completely FREE. No registration required, no payment needed. Get instant access to this valuable resource on PDFdrive.to!

Free Download PDF

About Programming With Unicode

No description available for this book.

Detailed Information

Author:	Victor Stinner
Publication Year:	2011
Language:	English
File Size:	0.34
Format:	PDF
Price:	FREE

Download Free PDF

Safe & Secure Download - No registration required

Why Choose PDFdrive for Your Free Programming With Unicode Download?

100% Free: No hidden fees or subscriptions required for one book every day.
No Registration: Immediate access is available without creating accounts for one book every day.
Safe and Secure: Clean downloads without malware or viruses
Multiple Formats: PDF, MOBI, Mpub,... optimized for all devices
Educational Resource: Supporting knowledge sharing and learning

Frequently Asked Questions

Is it really free to download Programming With Unicode PDF?

Yes, on https://PDFdrive.to you can download Programming With Unicode by Victor Stinner completely free. We don't require any payment, subscription, or registration to access this PDF file. For 3 books every day.

How can I read Programming With Unicode on my mobile device?

After downloading Programming With Unicode PDF, you can open it with any PDF reader app on your phone or tablet. We recommend using Adobe Acrobat Reader, Apple Books, or Google Play Books for the best reading experience.

Is this the full version of Programming With Unicode?

Yes, this is the complete PDF version of Programming With Unicode by Victor Stinner. You will be able to read the entire content as in the printed version without missing any pages.

Is it legal to download Programming With Unicode PDF for free?

https://PDFdrive.to provides links to free educational resources available online. We do not store any files on our servers. Please be aware of copyright laws in your country before downloading.

The materials shared are intended for research, educational, and personal use in accordance with fair use principles.