Making Sense of Unicode
Published on 10 Sep 2018
*This is a summary of the article: [What every programmer absolutely, positively needs to know about encodings and character sets to work with text](http://kunststube.net/encoding/) for a quick future reference. The demonstration codes are written in python.* ### Introduction A computer cannot store *letters*, *numbers*, *pictures* or anything else. The only thing it can store and work with are bits. Character encoding using internationally accepted standards permits worldwide interchange of text in electronic form. But there are so many of such standards often conflicting with each other. Unicode was introduced to address this issue. Some of the common questions in the mind of anyone dealing with text processing are; what is the difference between `unicode` and `utf-8`? Are they related? How the character encoding schemes for say Arabic `ISO-8859-6` and `Windows-1256` related to unicode? etc. I try to make some sanity to out of what appears to be a crazy labyrinth of encodings. ### Unicode vs UTF-8 In very simplistic terms, unicode maps every character (in every human language) to a unique number called codepoint. UTF-8, UTF-16 and UTF-32 are some standard character encodings implementing unicode. UTF-32 uses 4 bytes which is a waste of space, while UTF-16 uses 2 bytes most of the time and four bytes only when necessary. UTF-8 uses 1 byte for ASCII and then more bytes when necessary. i.e. code points 128 and above are stored using 2, 3, in fact, up to 6 bytes. ### Unicode, bytes and str in Python In python3 strings are represented in unicode. So there are no separate types: `unicode` and `str` as in python2. But in python3, `bytes` is a separate type from `str` unlike in python2. To get the bytes representation for a unicode character, encode it: 'ع'.encode('utf-8') # b'\xd8\xb9' To get back the original character decode it. `str()` doesn't work b'\xd8\xb9'.decode() gives ع #default utf-8 ord('‽') # 8253 chr(8253) # ‽ hex(8253) # 0x203d int('203d', 16) # 8253 print(u'\u203d') # ‽ can also write '\u203d' print(ascii('‽')) # '\u203d' len('\u203d') # 1 unicodedata.name('‽') # INTERROBANG print(u"\U0001F601") # 😁 # \U is 4 byte unicode which is beyond \u The package `unidecode` tries to convert a unicode to ascii !! from unidecode import unidecode unidecode('Straßenüber ') => Strassenuber '»'.encode('ascii', 'ignore') => '' # ignores non-ASCII #### Another Example ord('💪') # 128170 hex(128170) # '0x1f4aa' print('\U0001f4aa') # 💪 ### Malayalam Unicode The Malayalam unicode block is from 0D00 to 0D7F. But most useful characters are from 3330 (`\u0D02`) to 3405 (`\u0D4D`). It occupies just 128 code points, but some of them are unused ൝. #### Iterating over unicode for c in 'ബിനു': print (c) # ബ ി ന ു ### References - Must read by every programmer: [The Absolute Minimum Every Software Developer.. by Joel Spolsky](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/) - [Understanding unicode in python](https://learn.rmotr.com/python/understanding-unicode-in-python/strings-and-unicode/unicode-in-python)