These days, more and more of our web apps have to be ready and able to support international characters. It's a non-trivial problem, and one that causes many furrowed brows, because usually by the time you notice it, you've already screwed up some data. I've dealt with it many times, and found that it's generally much easier to prepare for the problem
before it occurs, rather than hack a solution together after the fact. This is not a post about i18n-ing your display templates, that's a whole topic in itself, even though some standard mechanisms are pretty well-defined by now. It's about issues involved in storing and displaying content with non-English characters.
A full discussion of character encodings, and the headaches thereof, is WAAAY beyond the scope of this post. It could fill a pretty weighty multi-volume book all by itself. So I'll just refer you to Joel Spolsky's article on the topic -
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) - and say that the character set that's most commonly used for international characters is
UTF-8.
So here's a quickie list of things to check for and bear in mind:
String Processing Issues
- Use Unicode strings
Internally, CFMX is entirely based on Java, and so it "should" use Unicode strings by default. The main things that you need to to worry about are when data goes in - into the database - and when it comes out - gets presented to the user. Of these two, the most important is the first - so long as data is being stored correctly, you'll always be able to get it out again. If it's being stored incorrectly, you're a bit screwed :)
- If you're using any Regular Expressions for processing or validating strings, BE CAREFUL!
It's very common to use expressions such as [a-zA-Z] to check for letters, or [a-zA-Z0-9] to check for alphanumerics. What do you think is going to happen if you pass an accented character such as à or é through this reg ex? Yup - é is NOT between a and z, so it will not match. How best to handle this is up to you - the POSIX regex elements such as [[:alpha:]] are good enough for some situations, but not others. For instance, POSIX does not allow more than 20 characters to be categorized as digits, whereas there are many more than 20 digit characters in Unicode. There's more detail on unicode.org
Data Storage
- Store everything in UTF-8
Even if you're designing an app that "won't ever need anything except English", a little thought and effort at the design stage will mean that next time you have to write a similar app that does need international characters, you can re-use the code. Besides, it's just good practise, and the extra storage overhead of storing up-to-4-bytes-per-character is relatively easy to deal with in these days of 750GB disks for £250.
- Check the collation on your databases, tables, and columns
Collations can be set at all levels of specificity, from the server right down to individual columns. Make sure that they're ALL UTF-8. Easiest way to do this is to generate a CREATE... script for your database (SQL Server) or a mysqldump file (MySQL), open it up in a text editor and search for COLLATE
- In SQL Server, make sure that any character-based field that can be populated from user-entered data in any way, is specified as a Unicode field
i.e. the type starts with an N. varchar fields should be nvarchar. text fields should be ntext.
- Plan ahead!
Plan to cope with multiple character sets up front, and you know your database can handle just about anything you're likely to throw at it. Sweeping it under the carpet and saying "we'll worry about that when it happens" is likely to end up with you taking your app offline while you rebuild tables and text indices. This can easily take several hours, for anything above a couple of thousand rows. Long downtimes make unhappy users.
Presentation
- Explicitly declare the UTF-8 character set on every page
Make sure that every page has a meta tag like this:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
...and also make sure that it's the FIRST tag in the Head of your doc, because as soon as a browser encounters a charset declaration, it starts re-parsing from the top.
- You "should" specify the primary language in the HTML tag
In ordinary HTML, this is just :
<html lang="en-GB">
In XHTML 1.0, it's slightly different :
<html lang="en-GB" xml:lang="en-GB" xmlns="http://www.w3.org/1999/xhtml">
And in XHTML 1.1, you don't need the lang attribute:
<html xml:lang="en-GB" xmlns="http://www.w3.org/1999/xhtml">
This makes it much easier for text readers, translation software, and - crucially - search engines to recognise the language and take appropriate action. More details at W3.org
- For multi-language markup, you can provide a comma-separated list of languages
You can (and should) also specify the language on any element within the page that is in a different language to the primary language. If your document structure doesn't break down to a logical tag that encompasses the different language part, then use a span tag:
<span xml:lang="fr-CA" >....</span>
- Language is a CSS pseudo-class
This means that you can specify different styling for different languages, like so:
/* smaller font for documents in German */
HTML:lang(de){ font-size:90%; }
/* italicise any bits of French in any document */
:lang(fr){ font-style : italic; }
/* change the quotation marks for any Q tag INSIDE a French element */
:lang(fr) > Q { quotes: '« ' ' »' }
More details at W3.org
There are many many more things to think about, and this list is by no means exhaustive, but it should be enough to give you a starting point. Lots of the issues that are thrown up by this tend to be general "take-a-step-back" kind of issues that make you question your workflows and assumptions, rather than your programming expertise - e.g. if we're letting people enter their own nickname, and then using that nickname as part of the url, then what's going to happen if someone enters a nickname entirely in Russian? How are we going to handle generated emails to them? What about if they're text-only emails?
(Hint: Content-type: text/plain; charset=utf-8 !) But there are also deeper issues involved - if we start accepting and labelling content in different languages, what facilities do we need to provide to our users in order to filter out - or focus exclusively on - particular languages and character sets? Do we need to create a separate site for each language? Or do we accomodate all the content in one site, with filters?
Answers to
those kinds of questions, I leave up to you :)