Unicode: One character set to rule them all

It’s time for a rant about Unicode. This, as some of us know (and nobody should have to know) is a way of encoding characters such that every character in every language in the world (and lots more, besides) has its own unique computer name, and thus, no matter what written languages I think my computer knows, or what written languages the person who is reading my web page, or who receives my email thinks they know, the computer can seamlessly display what I wrote, as I wrote it. I could be talking about the café I visited last night, or about a place on the California coast, Año nuevo, which would not be translated as “New Years Day” without that squiggle over the first “n.” I could be using an entirely different alphabet—Tamil or Devanagri or Arabic. Doesn’t matter. Unicode means that I no longer have to pay attention to anything but what I meant to say and meant to write. My words don’t get squiggles when read by Mac users or Windows users or people using the opposite of whatever I was using that morning. All that is past.

Or should be.

Back in December, folks at Google’s blog noted that a majority of the world’s webpages were now encoded using Unicode. Sadly, that means that almost half of the world’s web pages, and an uncounted number of spreadsheets, databases, and other repositories of written knowledge still do not.

Get with it, people. We are almost at the end of an entire decade past the second millenium of the common era. Every major modern operating system supports Unicode out of the box—you have to tell applications to ignore that wisdom and use a more limited character set. So, if your collections ever refer to a person who has an accent in his or her name, oops. If your collections ever require characters not in ASCII or some other country-specific character set, oops. If you ever need to share information with people in another country whose default character set is different from yours and you force your applications to ignore Unicode, your oops.

This is even true at my own institutions, where I had to demand that our webmaster start storing things in databases that used Unicode (utf-8, usually), and to start putting the correct character set in our page headers. We’re getting there.

And this is how I get bitten by encoding illiteracy with regular frequency. I was cutting some text from Word into my trusty HTML editor a few weeks ago and got gibberish. Oops. The application doesn’t understand Unicode. A friend of mine want to blog in Yiddish while on a jaunt to the old country. Oops. MySpace.com actually embeds code on its webpages that says, “english only” (charset=8859-1). Today, I noticed a message from someone on a Macintosh support list asking what Unicode was, and someone else carefully gave him a wrong wrong wrong answer, suggesting that it was best to simply code everything in “text”. The illiterate one meant, “unformatted text supporting only those characters that English-speaking computer programmers use,” which is not the “text” those of us who use Unicode mean.

This is an issue that matters a lot to cultural institutions. We now have a way in which to be alphabet and character set agnostic, and to abide by current international standards by doing so. Let’s not see any more websites or database applications that assume illiteracy or special, obsolete codesets. (for more on this subject, see Joel Spolsky’s excellent rant —written five years ago.)

תודה רבה

Thank you

Leave a Reply