First, some context. (Yeah, yeah, skip the context and go to the question) I'm writing the DAO's for the system as a layer of abstract interfaces, with a default implementation based on Hibernate, and using the everything-is-an-item pattern that I've blogged about before.
Aside: This pattern, in conjunction with Java Generics - a Java 5 mechanism that's kind of like C++'s Templating system, but without the horrors that infest the STL Standard Template Library - has lead to a really nice way of getting lots of basic operations (e.g. CRUD) "for free", which deserves a post of its own and I'll blog about it next time I get chance.
In this design, for every type of item, there's a corresponding DAO. In the DAO, there's a save() method which either adds the passed object to the DB, or updates it if it already exists.
This save() method is also responsible for throwing an exception if the given object can't be saved, as it would break the application's business rules for uniqueness. So each implementation of the save() method calls an isDuplicate() method, which is defined by default on the abstract ItemDAO, and can be overridden as appropriate on the subclass DAOs. For instance, it's not acceptable to have two Email records with the same messageID, but it's perfectly fine to have two Person records called John Smith - and this is where the interesting question arises...
In our model, a Theme is also an Item. A Theme, in this case, being a word or phrase that has been extracted from the content of an Item due to it being "potentially interesting". There's a whole load of extreme Lisp cleverness being worked on by M'Colleague Craig McMillan, he of the prodigious beard and piratical proclivities, regarding how you determine a word or group of words is potentially interesting, but in the words of Frank Drebin, that's not important right now.
The question is, what makes a Theme unique? On a superficial level, you can say that a Theme is a group of characters that make up words, so that group of characters must be unique. In other words, no two Theme records must exist with the same String in the "Title" field. But if you think about it a bit more, that might not actually be the case.
For instance, the problems of homonyms (a word that has the same pronunciation and spelling as another word, but a different meaning - e.g. "bat" the animal and "bat" in cricket) and polysemy (capacity of a word or phrase with multiple, related meanings that derive from the same etymology - e.g. "bank on it" with bank meaning to rely upon something, which derives from the reputation of bank-the-financial-instution for reliability) are perennial problems for Natural Language Processing. Should we take that into account in the model? Can we?
In the above cases, I can probably take the reasonable position that because the words are spelled the same, I can assume that they are actually related in this sense, and if two emails refer frequently to "bush", they should be pointing to the same DB record for the "bush" theme, regardless of whether they were talking about a US President, a shrub, or the Australian outback - and homonymy and polysemy can just be swept under the carpet. (Hmm, starting to get a lot of bulges under this carpet in the office here...where's my hammer?)
However, when you start to think about the possibility for the system to be dealing with multiple locales and even multiple languages, which may well be the case in large multi-national corporates, another, related, linguistic term starts to rear it's ugly head - the Heterologue.
A Heterologue is a word that occurs in multiple languages, possibly with completely different meanings. For instance, the syllable "bat" in Cantonese means the number eight (at least, when pronounced with a high inflection), and having seen the way the lovely Lisa flits between English and Cantonese sometimes several times a sentence, even in emails where she types out the Cantonese word phonetically with English letters, this may well occur.
It's not just languages that show heterologues either - the same word or phrase in the same language in a different locale can have completely different meanings - the web is strewn with plenty of examples of British / American English ambiguities, the classic one being at Rocom when my newly-immigrated American colleague Stuart asked what Martin and I were doing for lunch, and I replied that we were driving into town so that Martin could pick up some fags, and did he want to come too?
So it boils down to this - should I include the locale of a Theme in the check for uniqueness along with the title (i.e. the actual string) or not? Is a given string of characters unique just within a particular locale, or globally? I'm tempted to say globally, but I have a nagging feeling that at some point in the not too distant future, that choice may turn round and bite me.