Unicode vs Encoding

In this post I am going to try to explain as quickly and easy as possible the difference between Unicode and Encoding.

But first, here is the problem those 2 concepts are aiming to solve: how do we represent different symbols / characters in computer’s world?

In the beginning there was ASCII, which was able to represent any character using a number between 32 and 127. Characters with codes less than 32 were called “unprintable” / “control” characters (e.g. 10 is “line feed”, 8 is back-space, etc). Thus, all ASCII characters could be stored in the first 7 bits of a byte and everyone was so happy thinking that a big problem was forever solved.

The biggest problem with the ASCII standard was that it focused to address the English language requirements. Soon, A LOT of “standards” emerged that started to use the rest of 127 values of the byte to encode specific characters / symbols required by other languages or various applications needs. Problems appeared when someone from England (for example) was sending a byte of value 130 having a certain meaning (according to a certain standard) but was interpreted in China or France according to a different standard.

This mess was addressed by the 2 concepts that are the focus of this post. And we need both concepts in order to address the issues described above because, as we will soon see, the big problem was split into 2 parts that were solved through Unicode and Encoding.

Unicode

Unicode solves the “labeling” / “ID-ing” problem: it’s purpose is only to make sure that it assigns a unique ID (a number) to any symbol used in the world. Unicode does not care how this number will be actually encoded by computers (this is going to be another standard job). Unicode is just making sure that any symbol will receive a unique ID (again – a number) – this ID is called “code point”. There is NO LIMIT on the number of symbols Unicode can handle because there will always be a number available to be associated with a new symbol. Oh, and another smart decision was made to ease the transition from ASCII to Unicode: the first 127 Unicode values will be assigned to the old ASCII characters.

Encoding

So, Unicode assigned a number to a symbol. Now, it’s time to encode that number in a way that would allow computers to easily / efficiently use them. This is the job of encoding standards. And, as usual when it comes to solving optimal / efficiency problems, there is more than one answer depending on the application. This is why there are several encoding standards that have been defined to address different use-cases: UTF-8, UTF-16, UTF-7, UCS-4 etc.

The most popular is (arguably) UTF-8 as it provides a very smart / concise way of encoding the Unicode codes. Here are the rules:

  • As mentioned above, the first 127 Unicode symbols are exactly the original 127 ASCII characters. These will be encoded as in the old ASCII days: on one byte. This has a very important consequence: the documents that contain mostly old-ASCII characters will continue to have a very efficient encoding according to this standard. Also, another nice thing is that an old document containing only old-ASCII characters will be correctly interpreted by using this standard (thus, you have backward compatibility to old-ASCII). Thus, the first rule is: if we encounter a byte that has the first (most significant) bit “0″, this means that we have a Unicode symbol encoded in a single byte: 0xxxxxxx (the 7 bits represented as “x” will contain the Unicode ID).
  • If the Unicode ID does not fit in 7 bits, the first step is to use 2 bytes. The code for 2 bytes is the following: 110xxxxx 10xxxxxx. In other words, the first byte starts with 2 bits of “1″ (signifying that we will use 2 bytes for encoding the Unicode symbol) followed by a “0″ bit and then the bits actually used for encoding the Unicode. The second byte starts with the bits “10” which signifies that this is a continuation byte. In summary, we have 5 bits in the first byte + 6 bits in the second byte, thus 11 bits available to encode Unicode IDs on 2 bytes.
  • If 2 bytes are not enough, we can use 3 bytes: 1110xxxx 10xxxxxx 10xxxxxx. The first byte starts with 3 bits of “1″ (signifying that we will use 3 bytes for encoding the Unicode symbol) followed by a “0″ bit and then the bits actually used for encoding the Unicode. The second and third bytes start with the bits “10” which signifies that these are continuation bytes. In summary, we have 4 bits in the first byte + 6 bits in the second byte + 6 bits in the third byte, thus 16 bits available to encode Unicode IDs on 3 bytes.
  • The same rule can be applied to encode on 4, 5 and 6 bytes (the upper limit). The code for 6 bytes encoding is: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx. As before, the first byte tells us how many bytes are used for encoding this Unicode symbol (6 bits of “1″ thus we’ll use 6 bytes) and the rest 5 bytes are all continuation bytes (starting with “10”). In the 6 bytes representation you have 31 bits available to hold a Unicode code.
  • Thus, with UTF-8, you can encode up to 2*31 or 2.147.483.648 Unicode symbols – quite impressive, right?

    Another very very important property of UTF-8 is that it allows you to move forward or backwards. What does this mean? Let’s say that you perform a SEEK operation in a file encoded with UTF-8. The byte that you read will tell you it it is a continuation byte (it starts with a “10”) or not. If the former, then you can choose to move forward or backward in the file, byte by byte until you reach the first byte that is this a continuation byte. That will be the place from where you can safely start interpreting UTF-8 codes.

    As mentioned before, UTF-8 is just one of the encoding standards available. To make sure that the decoder is using the right algorithm, whenever you send a document (e.g. mail, web page) you need to send the encoding information:

         email: Content-Type: text/plain; charset=”UTF-8″
         web page: <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    

    Obviously this works with the assumption that a commonly agreed default encoding scheme (usually ASCII) will be used by default until the information about the encoding information is encountered. After that, this encoding information will be used to interpret the rest of the document (email or web-page), thus the encoding information needs to be provided as early as possible in the document.

This entry was posted in Software Engineering. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s