Friday, December 08, 2006

Friday Brainf**k : How Unique Is A Phrase?

Eep - posts have been a bit thin on the ground recently, due to big crunch time on SONAR, but here's an interesting question that's just cropped up, and I'm not sure of the answer to, and blogging the question might just help me get my own thoughts on it straight....

First, some context. (Yeah, yeah, skip the context and go to the question) I'm writing the DAO's for the system as a layer of abstract interfaces, with a default implementation based on Hibernate, and using the everything-is-an-item pattern that I've blogged about before.

Aside: This pattern, in conjunction with Java Generics - a Java 5 mechanism that's kind of like C++'s Templating system, but without the horrors that infest the STL Standard Template Library - has lead to a really nice way of getting lots of basic operations (e.g. CRUD) "for free", which deserves a post of its own and I'll blog about it next time I get chance.

In this design, for every type of item, there's a corresponding DAO. In the DAO, there's a save() method which either adds the passed object to the DB, or updates it if it already exists.

This save() method is also responsible for throwing an exception if the given object can't be saved, as it would break the application's business rules for uniqueness. So each implementation of the save() method calls an isDuplicate() method, which is defined by default on the abstract ItemDAO, and can be overridden as appropriate on the subclass DAOs. For instance, it's not acceptable to have two Email records with the same messageID, but it's perfectly fine to have two Person records called John Smith - and this is where the interesting question arises...

In our model, a Theme is also an Item. A Theme, in this case, being a word or phrase that has been extracted from the content of an Item due to it being "potentially interesting". There's a whole load of extreme Lisp cleverness being worked on by M'Colleague Craig McMillan, he of the prodigious beard and piratical proclivities, regarding how you determine a word or group of words is potentially interesting, but in the words of Frank Drebin, that's not important right now.

The question is, what makes a Theme unique? On a superficial level, you can say that a Theme is a group of characters that make up words, so that group of characters must be unique. In other words, no two Theme records must exist with the same String in the "Title" field. But if you think about it a bit more, that might not actually be the case.

For instance, the problems of homonyms (a word that has the same pronunciation and spelling as another word, but a different meaning - e.g. "bat" the animal and "bat" in cricket) and polysemy (capacity of a word or phrase with multiple, related meanings that derive from the same etymology - e.g. "bank on it" with bank meaning to rely upon something, which derives from the reputation of bank-the-financial-instution for reliability) are perennial problems for Natural Language Processing. Should we take that into account in the model? Can we?

In the above cases, I can probably take the reasonable position that because the words are spelled the same, I can assume that they are actually related in this sense, and if two emails refer frequently to "bush", they should be pointing to the same DB record for the "bush" theme, regardless of whether they were talking about a US President, a shrub, or the Australian outback - and homonymy and polysemy can just be swept under the carpet. (Hmm, starting to get a lot of bulges under this carpet in the office here...where's my hammer?)

However, when you start to think about the possibility for the system to be dealing with multiple locales and even multiple languages, which may well be the case in large multi-national corporates, another, related, linguistic term starts to rear it's ugly head - the Heterologue.

A Heterologue is a word that occurs in multiple languages, possibly with completely different meanings. For instance, the syllable "bat" in Cantonese means the number eight (at least, when pronounced with a high inflection), and having seen the way the lovely Lisa flits between English and Cantonese sometimes several times a sentence, even in emails where she types out the Cantonese word phonetically with English letters, this may well occur.

It's not just languages that show heterologues either - the same word or phrase in the same language in a different locale can have completely different meanings - the web is strewn with plenty of examples of British / American English ambiguities, the classic one being at Rocom when my newly-immigrated American colleague Stuart asked what Martin and I were doing for lunch, and I replied that we were driving into town so that Martin could pick up some fags, and did he want to come too?

So it boils down to this - should I include the locale of a Theme in the check for uniqueness along with the title (i.e. the actual string) or not? Is a given string of characters unique just within a particular locale, or globally? I'm tempted to say globally, but I have a nagging feeling that at some point in the not too distant future, that choice may turn round and bite me.


Anonymous said...

Two words: Semantic Web.

Alistair Davidson said...

The Semantic Web would make my life much easier in some respects, but unfortunately
a) it's not here yet
b) it has a few inherent problems (e.g. trustworthiness) even at this conceptual stage
c) it's only useful in this context if EVERYTHING is semantically marked-up.

The Natural Language Processing in SONAR is all about extracting potentially interesting words and phrases from arbitrary sources - emails, documents in a document management store or attached to emails, RSS feeds and even Instant Messenger conversations - in essence, it's almost an automatic semantic-marker-upper.

I know that the fundamental issues of how the information is stored are going to have deep knock-on consequences further down the line, and that's what's turning my brain into oozing melted cheese at the moment - trying to think through what these consequences will be....

Barney said...

I certainly can't say that understand all aspects of what you're doing, but it seems to be that the better way would be to have locale-neutral themes and then when viewing a theme (in whatever manifestation that takes) you have the ability to constrain the results either be allied themes or by the locale of the resources tagged with the theme.

A view of the "bush" theme would render everything with the string "bush" in it, but then you can say you only want US resources or Australian resources. Or combine two themes: I want "bush" and "president" or perhaps "bush" and not "australia".

Alistair Davidson said...

Hi Barney,

Thanks for the suggestion. I'm still going to stick with locales on themes for now, though, for a number of reasons:

1) it allows simpler filtering of available themes - e.g. most people's default language will be English, and there's no reason for them to be shown a theme they may not understand (e.g. "烤肉") unless they explicitly choose to

2) we're not just extracting single-word themes (i.e. tags) but multi-word themes - "digrams" and "trigrams" in NLP-speak - which vastly reduces the scope for contextual ambiguities (at the cost of vastly increasing the complexity of the algorithms)

but most importantly of all -

3) it's too late to change that and still meet our deadlines! :)

Basically, Themes are a subclass of Items so that they can be subscribed to using the same mechanism that's used to subscribe to Groups and People. All Items have a locale, so that they can be filtered. It's just a question of whether two locales can have the same string of characters as a Theme or not, and at the moment I can semi-convince myself either way.

Good luck with the job move!

PS - I miss CFQUERY too, but not as much as CFOUTPUT

iRocket said...

ok..not that it means a whole lot, but have you ever asked yourself "Am I over-thinking this?" I like the KISS rule...Keep it Simple Seymore!

It may be just me, but it sounds like you are trying to find a way to inflate a tire with a Nuclear powered, triple redundent, auto-extinguishing, self-propelled widget.

Alistair Davidson said...

Hi Scott,

"KISS" - Wise words indeed, and ones that I try to follow whenever I can. Unfortunately, this is where the "core cleverness" of our system will be built, and we have to get it right.

In the end, with KISS in mind, I went for just basing the uniqueness check on the theme title itself - a given string of characters is unique, whatever language and locale it occurs in.