Monday, November 19, 2007

Beware Case-Insensitive Comparisons on Oracle!

Let's say you have a large database with lots of People in it. Each Person can have many EmailAddresses. Let's say that your code needs to be portable across multiple DB systems (specifically, MySQL and Oracle).

Let's also say that you process up to a million emails per day through your system, and for each email you check the sender and recipient addresses against your big DB. So that's potentially an awful lot of reads, and a potential performance bottleneck.

"No Problem!" you say, "I'll just stick an index on the EmailAddress.address field, and I'm sorted"

...but that's not quite it. EmailAddresses are case-insensitive, so that FOO@BAR.COM is the same as foo@bar.com. MySQL does a case-insensitive comparison by default, but Oracle doesn't.

So how do you get Oracle to do a case-insensitive comparison? Well, to cut a long story short, there are two parameters you can set (so long as you're on at least 10gR2):
alter session set NLS_COMP=LINGUISTIC;
alter session set NLS_SORT=BINARY_CI;


Setting NLS_COMP to LINGUISTIC tells Oracle to perform strict case insensitivity operations, and NLS_SORT governs the sort order. The _CI suffix tells it to be case-insensitive with respect to sorting.

Lovely jubbly, smashin' sorted and great - well, actually, no. Not sorted and great at all, because there's a lovely implementation gem there - "Setting NLS_SORT to anything other than BINARY causes a sort to use a full table scan, regardless of the path chosen by the optimizer.".

So Oracle's implementation of a case-insensitive comparison is, basically, "i'm going to go through every single row in the table one-by-one, checking to see if it matches".

As Mel Smith used to say - "Aw, it's marvellous, innit?"

So what can you do? Obviously a full table scan is BAAAAD, so we need to find a way to make it use an index. There are two things you can do**:

1) Add a explicit lower-case-address field, that duplicates the address field and explicitly lower-cases the address before writing. You can then add an index on this field, and check against that. This means you duplicate info (which is bad) and use more storage than you need (which is also bad) but it's totally portable (which is good)

2) Add a functional index to the address field:
CREATE INDEX ix_my_index_name ON email_addresses( LOWER(address) )
This means you gain the vastly superior performance of an index range scan as opposed to a full table scan, but guess what - MySQL doesn't have functional indexes yet.

GRAAAAAAAAAAAAAAAAAAAARGH!


**OK, so there is a third and possibly fourth option in this particular case, to just lower-case the address field anyway, but the point of this post was meant to be general and it does have a downside too.

You potentially lose capitalisation of the personal part of an email address. For instance:
"Alistair Davidson" <apdavidson@some-domain-im-not-going-to-tell-you.com>
would get flattened to "alistair davidson" <apdavidson@some-domain-im-not-going-to-tell-you.com>

So you want to try to keep the capitalisation if possible.

Probably the best thing to do here is actually to split the personal part from the address, and have the personal part stored as-is, with the address part lower-cased.