2010-02-04

Background

UTF-32 is a character encoding that uses fixed size 32 bit elements to represent each code point. A code point is not equivalent to a character; characters can be made up of more than one overlayed code point.

UTF-16 is another character encoding that encodes the most commonly used code points into single 16 bit elements, The remaining code points can be encoded with two 16 bit elements.

UTF-8 is another character encoding that encodes the most commonly used code points into single 8 bit elements. The remaining code points can be encoded with two, three, or four 8 bit elements. All ASCII characters are represented by the same codes in UTF-8.

There are additional encodings, such as UTF-7 and UTF-EBCDIC, which are not described here because they aren't relevant to this rant.

Rant

I've often read that UTF-16 is used by many APIs because it is a "good compromise" between the space efficiency of UTF-8 and the simplification of UTF-32 having fixed size code points (which is a marginal simplification anyway since UTF-32 can still require more than one code point to represent one character). This is bullshit. UTF-16 only exists because 16 bit encodings were being used before UTF-32/UTF-8 were defined. The sole disadvantage of UTF-8 compared to UTF-32 is that software must handle variable length code points. Software also must handle variable length code points for UTF-16. Almost all UTF-16 strings require more space than the equivalent UTF-8 strings. So UTF-16 combines the disadvantage of having variable length code points compared to UTF-32 with the disadvantage of increased storage space over UTF-8.

allanh@kallisti.com