Encodings

Martin Ueding

2013-08-02

Code & Zahlen

One of the bad things of the computer is that it was invented by people who thought that their language is the only one there is. So the regular ASCII encoding has only 256 characters. This is enough if you are only concerned with English.

But, and I realize that this might be an exception, but there are people, who do not use English solely, and still want to use a computer for some reason only they know. So these other people invented stuff like ISO-latin1 to accommodate their weird "ä" or "ß" for example. German has only seven special characters, so they fit in nicely in those 256. But all 256 are already used in ASCII, the higher ones are characters for stuff like fractions and pretty borders.

The problem is that there are only 256 slots, but they are just numbers. So everybody can interpret them as they like. This lead to the horror we call code pages. You can tell the computer that this is a German text and it will interpret a given number as an "ö" instead of "¶" for example. This works only if everybody on the way knows that it is a German text. Say you write a function that capitalizes everything.

The easy approach is just the following (C code):

char c;
if ('a' <= c && c <= 'z')
    c += 'A' - 'a';

So it converts the letters a-z to A-Z. This may work with English words, but what happens to "büro" for example? It will become "BüRO" instead of the desired "BÜRO". So you need to tell your function that if the text is a German text, than there are additional mappings from "ä" to "Ä", "ö" to "Ö" and "ß" even to "SS".

If you forget to specify this option, or the function does not have this option, or it has the option, but does not support your specific code page, then you get wrong results. But this does not bother you at all, if you only have English text, since you cannot imagine anything that could go wrong with this. So there is no reason to fix anything.

This ambiguity becomes even more apparent when the text is displayed in the wrong encoding. So your "Büro" might become "B\~ro" or something like that. Your readers will appreciate it.

And every language basically has their own code pages, right, pages! And it is impossible to display different languages in the same text because, say "ø" and "ß" have the same code for some reason. So if you interpret it in one language, you will have the "ø" for both, or "ß" for both. This is not correct, of course, and therefore limits you a lot.

If functions, like mentioned above, try to do something with the content, they have to interpret it, which is hard since they need to know the encoding. And all of them default to something else, which makes it really hard to coordinate this mess.

Unicode

Finally, somebody came up with the realization, that there are more than 256 unique characters out there. Back in the day, where every byte counted, I can see that people did not want to use two bytes per character.

If you just use two bytes per character, you will get 65536 possible characters. This is a start. (This is UTF-16, by the way.) But as we have seen above, setting any arbitrary number does not really work, there are so many characters and new ones are added, like the €-sign.

As a side effect, every single text becomes twice as large, even if it is only English text where even 128 characters would be fine. But the idea is clear, there are many more characters than we can think of right away.

The solution seems to be to have a huge table, the unicode table where each characters gets a unique number. So there is no way to mess up "ø" with "ß" like always happened when your encodings were off.

Since the numbers grow pretty big, some are bigger than two bytes, you need lots of bytes to display a character. But this again is a fallacy. Say there are numbers that are up to three bytes long, you decide that you use three bytes to represent a char. Not only will you store a lot of leading zeros that way, but what happens if the table grows up to four bytes worth of characters? Not going to happen? Yeah, just like the oil will last forever or so.

So there is a need for numbers that can get as big as needed but not take up more space than needed. This is where the UTF-8 encoding comes in play.

Important: Unicode and UTF-8 are independent, UTF-8 is just a means to write the Unicode numbers down. That is why UTF means like "Unicode Transformation Format".

It boils down to using one byte for the letters that were in the lower part of the encodings all along. Two bytes are needed for all the upper letters. And if you have letters that did not fit within those letters, you need three. But only for this letter.

German texts become a little bigger since every "ä" is now two bytes long and not just one, but it is unique and unambiguous. Since I use UTF-8 for this text, it is not much bigger than if I had used some latin1 or so, but I can casually store all the characters I want in it.

Downside is that people who only have special letters from the English perspective, pay the bill in file size, since all of their characters require two or three bytes. With their local encodings, they might needed less space for all this.

But I do think it is worth it.

Unicode Implementations in Real Life

PHP and Websites

Sadly, there is so much software out there that does not understand unicode or does not use it as a default. To give you one example, I tried to set up a web application with PHP and MySQL in UTF-8. These are all the points where I need to specify that I want to use UTF-8:

MySQL administration interface connection (phpMyAdmin)
MySQL table
MySQL table text fields
MySQL \<-> PHP connector
PHP functions like htmlentities()
HTTP header
HTML document

If you forget it in a single spot, everything is messed up since programs will default to latin1. I learned about the encoding of the MySQL-PHP-connector the hard way, since it has to press the tons of characters into 256 when not using UTF-8. So all my special chars just became question marks. If you forget to set it in the HTML document, the browser will think that it is "regular" text and display every byte in the encoding of his choice, which depends on your locale. That way, multi-byte-characters become multiple characters, which is not what is wanted either.

Python

In Python 3, they nailed it. There are two data types, unicode and bytes, basically.

In C, your strings are char*, i. e. arrays of chars, an array of bytes. This is raw data. Without knowing the encoding, you cannot know any of the letters, in a strict sense. You need to decode that raw data into characters. Since ANSI has the lower 127 letters fixed, there is a one-to-one correspondence. But again, this works for the English language.

Python 3 does not allow to call character functions like upper(), but forces you to call decode() with an encoding (UTF-8 by default) to convert the raw data into characters. Then you can change to uppercase.

Before saving that in a text file to disk, you need to encode() again. Since a unicode variable hold characters, not bytes, you have to convert it.

Conclusion

My dream is that one day UTF-8 is the default for everything, and there is no need for encoding fields any more.

This is a great example of how a quick and dirty implementation lead to a lot of hacks to fix the problem only to create more problems that make using the solution even harder.

But, I would not have all this freaking fun with encodings if there was only one …