Instant Badger: 11/01/2006

Monday, November 20, 2006

Integrating Applets With AJAX

For our new SONAR product, as demo'ed in the Enron Explorer, we needed a way of visualising the social network of up to 80,000 people. Our first thought was of course, Flash, but we found that it just wasn't up to the job^* of visualising large numbers of nodes. If only there was a way of combining the scalability of a Java applet with a slick, whizz-bang AJAX interface... well, as luck would have it, there is!

There are certain problems you have to get round, particularly related to the issue of the applet showing through anything that you place over it, and the browser reloading the VM if you hide the applet in any way , but these are fairly straightforward to get round once you've figured it out (hint: try moving it off screen!) - m'colleague Jan Berkel blogs about the technique in more detail here.

It would be overstepping the mark to suggest that we invented the technique as speculative articles have been written on this subject before, but the feedback we've been getting from the Enron Explorer shows that a lot of people are taken most of all by the interface, and as far as we know we're the first to use it in a production application.

All we need now for the meme to take hold, of course, is a snappy acronym....
- APAX?
- APJAX?
- JAPAX?

Hmmm..... there HAS to be an amusing acronym we can tease out of this - all suggestions gratefully received!

* OK, I'm fairly sure that, given enough time, we could probably have found a way of getting round the scalability issues associated with doing it in Flash, but as Jan mentions in the article, we had a large amount of pre-written Java code that it would have been a shame to waste, and what you also get with Java is a vast library of free code and APIs out there, plus complete flexibility over how you use them.

Monday, November 13, 2006

10 Things to Check for Supporting International Characters In Your Web App

These days, more and more of our web apps have to be ready and able to support international characters. It's a non-trivial problem, and one that causes many furrowed brows, because usually by the time you notice it, you've already screwed up some data. I've dealt with it many times, and found that it's generally much easier to prepare for the problem before it occurs, rather than hack a solution together after the fact. This is not a post about i18n-ing your display templates, that's a whole topic in itself, even though some standard mechanisms are pretty well-defined by now. It's about issues involved in storing and displaying content with non-English characters.

A full discussion of character encodings, and the headaches thereof, is WAAAY beyond the scope of this post. It could fill a pretty weighty multi-volume book all by itself. So I'll just refer you to Joel Spolsky's article on the topic - The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) - and say that the character set that's most commonly used for international characters is UTF-8.

So here's a quickie list of things to check for and bear in mind:

String Processing Issues

Use Unicode strings
Internally, CFMX is entirely based on Java, and so it "should" use Unicode strings by default. The main things that you need to to worry about are when data goes in - into the database - and when it comes out - gets presented to the user. Of these two, the most important is the first - so long as data is being stored correctly, you'll always be able to get it out again. If it's being stored incorrectly, you're a bit screwed :)

If you're using any Regular Expressions for processing or validating strings, BE CAREFUL!
It's very common to use expressions such as [a-zA-Z] to check for letters, or [a-zA-Z0-9] to check for alphanumerics. What do you think is going to happen if you pass an accented character such as à or é through this reg ex? Yup - é is NOT between a and z, so it will not match. How best to handle this is up to you - the POSIX regex elements such as [[:alpha:]] are good enough for some situations, but not others. For instance, POSIX does not allow more than 20 characters to be categorized as digits, whereas there are many more than 20 digit characters in Unicode. There's more detail on unicode.org

Data Storage

Store everything in UTF-8
Even if you're designing an app that "won't ever need anything except English", a little thought and effort at the design stage will mean that next time you have to write a similar app that does need international characters, you can re-use the code. Besides, it's just good practise, and the extra storage overhead of storing up-to-4-bytes-per-character is relatively easy to deal with in these days of 750GB disks for £250.

Check the collation on your databases, tables, and columns
Collations can be set at all levels of specificity, from the server right down to individual columns. Make sure that they're ALL UTF-8. Easiest way to do this is to generate a CREATE... script for your database (SQL Server) or a mysqldump file (MySQL), open it up in a text editor and search for COLLATE

In SQL Server, make sure that any character-based field that can be populated from user-entered data in any way, is specified as a Unicode field
i.e. the type starts with an N. varchar fields should be nvarchar. text fields should be ntext.

Plan ahead!
Plan to cope with multiple character sets up front, and you know your database can handle just about anything you're likely to throw at it. Sweeping it under the carpet and saying "we'll worry about that when it happens" is likely to end up with you taking your app offline while you rebuild tables and text indices. This can easily take several hours, for anything above a couple of thousand rows. Long downtimes make unhappy users.

Presentation

Explicitly declare the UTF-8 character set on every page
Make sure that every page has a meta tag like this:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

...and also make sure that it's the FIRST tag in the Head of your doc, because as soon as a browser encounters a charset declaration, it starts re-parsing from the top.

You "should" specify the primary language in the HTML tag
In ordinary HTML, this is just :
<html lang="en-GB">

In XHTML 1.0, it's slightly different :
<html lang="en-GB" xml:lang="en-GB" xml‍ns="http://www.w3.org/1999/xhtml">

And in XHTML 1.1, you don't need the lang attribute:
<html xml:lang="en-GB" xml‍ns="http://www.w3.org/1999/xhtml">

This makes it much easier for text readers, translation software, and - crucially - search engines to recognise the language and take appropriate action. More details at W3.org

For multi-language markup, you can provide a comma-separated list of languages
You can (and should) also specify the language on any element within the page that is in a different language to the primary language. If your document structure doesn't break down to a logical tag that encompasses the different language part, then use a span tag:
<span xml:lang="fr-CA" >....</span>

Language is a CSS pseudo-class
This means that you can specify different styling for different languages, like so:
/* smaller font for documents in German */
HTML:lang(de){ font-size:90%; }
/* italicise any bits of French in any document */
:lang(fr){ font-style : italic; }
/* change the quotation marks for any Q tag INSIDE a French element */
:lang(fr) > Q { quotes: '« ' ' »' }

More details at W3.org

There are many many more things to think about, and this list is by no means exhaustive, but it should be enough to give you a starting point. Lots of the issues that are thrown up by this tend to be general "take-a-step-back" kind of issues that make you question your workflows and assumptions, rather than your programming expertise - e.g. if we're letting people enter their own nickname, and then using that nickname as part of the url, then what's going to happen if someone enters a nickname entirely in Russian? How are we going to handle generated emails to them? What about if they're text-only emails? (Hint: Content-type: text/plain; charset=utf-8 !)

But there are also deeper issues involved - if we start accepting and labelling content in different languages, what facilities do we need to provide to our users in order to filter out - or focus exclusively on - particular languages and character sets? Do we need to create a separate site for each language? Or do we accomodate all the content in one site, with filters?

Answers to those kinds of questions, I leave up to you :)

Tuesday, November 07, 2006

Alas, poor Smartgroups, I knew it well....

It's not without a tinge of sadness and wistful sigh that I noticed that, in possibly the least-surprising announcement I've seen for a long time, Orange is to finally pull the plug on Smartgroups

Smartgroups was, in its day, the Daddy of first-generation community applications, and I worked on it for just over two years just after it got bought out by Freeserve, who became wholly-owned by Wanadoo, then France Telecom, then finally Orange UK.

(Remember those "Ten signs you are in a dotcom company" emails that went round about the turn of the millenium? One of them particularly stuck in my mind - "You've sat at the same desk for two years, and worked for four different companies")

I learnt a lot about managing large-scale systems from that job - SG handled more than 50 million emails per month - and a good percentage of the stories I'll come out with after several late-night-beers-with-other-techies hail from that time. I also learned a lot about human nature, and the myriad ways in which people will never fail to suprise you, even when you think you've seen it all. And I'm not just talking about the users there...

But ultimately, it was a failure to evolve and keep up with the competition (e.g. MySpace) that gradually put the nails in its coffin, and after about four years of knowing full well that its days were numbered, SG has finally been taken out the back and given a nice sunny wall to stand against, and a roll-up to smoke. And a blindfold.

Farewell Smutgropes - you will live on in the memories of all those who worked on you. (Despite some quite determined scrubbing with Mind Bleach in some cases) And if anyone manages to track down the new homes of the Stereo Stimming group, the Brent Spiner Data Lovers group, or my own personal favourite - Hairy Bearded Scotsmen In Kilts ("ONLY pictures of hairy bearded scotsmen in kilts will be accepted - any pictures of hairy bearded men in kilts who are NOT Scottish will be deleted...") then be sure to let me know. Such gems of the longest of long tails are surely too fine to be lost forever :)

Instant Badger

Monday, November 20, 2006

Integrating Applets With AJAX

Monday, November 13, 2006

10 Things to Check for Supporting International Characters In Your Web App

String Processing Issues

Data Storage

Presentation

Tuesday, November 07, 2006

Alas, poor Smartgroups, I knew it well....

Blog Archive

Other Links

Popular Posts

Random Photos from my Flickr