Lithoglyph

On Unicode

UTF-8 is really a great invention. In an oversimplified sense, Ken Thompson’s insight saved the entire Unicode project. The beauty of UTF-8 is actually derived from two (among many other) observations: (1) Most C code work in an encoding agnostic manner, so long as they don’t fiddle with sign semantic (ie. they don’t do this if (character < 0) thing); (2) Modern C compiler has maintained the one-byte-ness of char, and no one bothers (or dares) to change the type's fundamental definition anymore, no matter how many unnecessary CHAR, Ch, UINT8, int8_t there have been.

At Lithoglyph we follow this great insight and develop our internal working code solely on the principle of encoding agnosticism. Even we use char, we are using it as if we were using unsigned char. We only use the byte semantics, and we use UTF-8.As a developer of Mac OS X software we are lucky to be able to use UTF-8 for most of our life. Cocoa's NSString uses UTF-16 internally, but comes with two nifty and apropos methods to work with UTF-8. The class method, NSString's +stringWithUTF8String:, and the instance method, -UTF8String, are all you need for interfacing your working parts (say string processing code written in C++) with user interface code written with Cocoa. Also the good thing is the compiler always translates @"string" into an NSString (which is UTF-16 internally). But don't ever use non-ASCII characaters in that--you'll need to use [NSString stringWithUTF8String:"ああ"] for that.

We have later developed a similar strategy on Windows. The rule is simple: We maintain the usage of UTF-8 throughout the internal workings, and interfacing with Windows only using UTF-16, or actually wchar_t and wstring.It's a known (and misfortunate according to some) fact that the Windows platform has a plethora of string libraries (think of Microsoft's own Standard C Lib, Platform SDK [Win32], MFC/ATL, then CLR, and we're not counting libraries in VB and OLE), and many of them have then two (or three) variants: ASCII (actually, "such that conforms to the current running Windows's code page"), Unicode (UTF-16LE). There is also MBCS (multi-byte character set, like UTF-8) that is often counted as ASCII, but sometimes stands out sui generis. From a historical perspective, those different libraries have existed for their own reason. It's easy to deride Microsoft engineers' nearsightedness, but let's not forget a culture like Windows's has its own burden to bear (which is not easy to shake off), and therefore there is always compromise (reads: backward compatibility for all!) in the design.

Excellent books like J. M. Hart's Windows System Programming offer programmers a few choices, or some “Unicode strategies”, one of which they should pick up in the very beginning when they’re into the platform. We take the advice of Hart’s book, and choose to use Unicode throughout.

In fact, we often use W-ending API functions explicitly in some low-level code so as to tell others that don’t ever think this will work with the so-called ASCII environment anymore. I think that’s a very important point to make. We live in 2007, almost 2008 now. There’s no point to use “ASCII”-based API anymore.So to sum up, UTF-8 throughout the code, UTF-16 when interfacing OS X and Windows.But what about locale-specific processing like sorting names? Our observation is that it’s best to leave that part to the UI–and fortunately 99% of the case it’s the UI that needs to be locale aware. We try to design internal workings as locale neutral as possible. This not just fits in our belief in encoding agnosticism, but it can also yields efficient code–because we can still use stdc’s qsort or C++’s generic sort without much worry. We let the UI code (and anyway both OS X and Windows have their own excellent libraries for that) deal with the locale.

Leave a Reply

Powered by WordPress.
Copyright © 2007-2008 Lithoglyph Inc.
All rights reserved.