Wednesday, December 20, 2006

(NSFW) Taking "User Generated Content" a bit too far

The Register is reporting on a heroic altruist's solution to a sticky situation regarding copyright-infringing contributions to Wikipedia :

Wikipedia semen shortage filled by User Generated Content

....well, it's certainly very public-spirited of the honourable member.

Wednesday, December 13, 2006

Going loco with Mo'MoSoSo

Nice to see that m'former'colleague Rik Abel is valiantly continuing his one-man crusade to get the gloriously-silly acronym MoSoSo more widely adopted, whilst simultaneously pointing out its majestic silliness in an ironic, post-modernist, self-referential way. Or something....

Don't get me wrong, I think Mobile Social Software could well be the killer app of Web 2-point-whatever-we're-up-to-now, but seriously folks - MoSoSo? That's Well Jackson!*

Make sure you don't target your MoSoSo applications at the Small office / Home Office market, 'cause that would be SoHoMoSoSo.

And if your users wore black eyeshadow and jaggedy-cut hairstyles, would that make it SoHoEmoMoSoSo ?

And if it was in a Bohemian New Romantic style...? (BohoRomoSoHoEmoMoSoSo)

Yikes, this one could run...

* - look here if you don't get the reference

Tuesday, December 12, 2006

Trusting The Magic Pixies - Hibernate HQL

I'm having trouble trusting the Magic Pixies. I admit, it's a common theme of mine, and maybe it's just NIH syndrome-by-proxy, but Magic Pixies - able and willing though they are - can only do your job for you if you ask them in just the right way. And sometimes they seem like they're being deliberately dumb and obstinate little buggers that just won't do as they're told, dammit!

(ahem) Perhaps I should elaborate here...

The Magic Pixies in question here are the ones that do the "hard" work for you in Hibernate, the near-as-dammit de facto standard ORM system for Java. Or Persistence Management. Or whatever term you prefer. Pete Bell has been blogging in great detail about the process of writing his own cf-based ORM framework, so if you're not familiar with ORM systems then you're probably best going to check out his blog, because this is a pure and simple straightforward bit of spleen venting.

The basic idea of Hibernate is that you can write your object model just as you choose, and then map your objects to persistent entities using (surprise surprise) an XML config file - or, if you're using Java 5 and EJB 2 and Hibernate 3, then you can dispense with the XML config file (yay!) and use annotations instead (double yay!) So long as you construct your mappings correctly, then you "don't need to worry about" the SQL - the Magic Pixies of Hibernate will auto-generate your db schema and SQL queries, and magically do your CRUD for you with a sprinkling of their Magic Pixie Dust™.

"But what if I want to do something a bit more complex than basic CRUD?" I cried.

"Like what?" said the Magic Pixies

"Like a left outer join?" I replied

"Oh, you don't need to worry about all that nasty SQL" said the Magic Pixies, " because that would tie your code to your database system, and that's a BAD THING! BAD Al! BAAAAAD!"

"Gosh, sorry Magic Pixies," I said, rather sheepishly, "I promise to use that nice abstracted Criteria API that you so generously provided in future"

"And so you should!" said the Magic Pixies, "Remember, you do the code, we do the data, otherwise we'll have harsh words with our union rep, OK?"

"Ok! Ok!" I said. "Now could you stop hitting me with that rolled-up newspaper please?"

"So long as you promise to be good"

"I do! I do!"

"Ok then"

"But what if I want to do something more complex than that?" I asked.

The Magic Pixies looked a bit puzzled.

"What on Earth could you want to do that's more complex than that?" they replied.

"Well, what if I wanted to find a set of objects of type X that didn't have any corresponding objects of type Y that match certain criteria?"

"Pfffft!" said the Magic Pixies, " that's easy, you just create Criteria along the association paths!"

"Er, huh?" I said.

"You just create a Criteria object for class X, and then create another Criteria object using that Criteria object by passing the name of the property of class X which refers to the encapsulated class Y contained within class X!"

"Huh?" I said.

"Or if you really want to, you can use HQL"


"Yes, Hibernate Query Language. It's almost like SQL, but not quite. Because we wouldn't want you using SQL - SQL's tied to database platforms, and that's BAAAAAD"

"So how do I do it HQL?"

"You create a query that queries along the association paths and properties of the objects"

"Oh, right, OK" I said. "But what if class X doesn't have class Y as a property?"

"Er..... huh?" said the Magic Pixies.

"Class X has no property that refers to class Y"

"Well then, you won't need to query for it, will you?" said the Magic Pixies, a touch too smugly for my liking.

"But I do!" I insisted.

"Er.... huh?" said the Magic Pixies.

"Well, say if I had a table of ItemLinks...." I began.

"A WHAT of ItemLinks????"

"Sorrysorrysorry! I mean an ItemLink object..."

"That's better!"

"...that represented a weighted link between two Items, such as might be calculated by some very complicated fuzzy logic and Natural Language Processing"


"...and a separate LinkPreference object that represented an preference expressed by a Person as to whether their ItemLink to a particular object would be public or not"

"Erm... can you give us an example?"

"Sure - this clever NLP stuff might detect that Bob from SysAdmin has been talking a lot about clustering database servers, and he might want to share that link so that he is known as an expert in that field."

"OK, with you so far..."

"But it might detect that the boss has been talking with his secretary about a dirty weekend in Brighton, and they really wouldn't want that shared at all, would they?"

"Erm, isn't that an outmoded stereotype that just reinforces age-old gender-typecast notions of sycophantic star-crossed secretaries as prey for the equally-stereotypical notion of amoral boss-as-alpha-male-predator?"

"Alright, alright, but you get the idea!" (that Magic Pixie was really starting to get on my titty ends)

"Yes, I follow you"

"So what if I want to query for all ItemLinks that have been created in the last, say, two weeks, and that don't have a corresponding LinkPreference?"

"Well, you could query for ItemLinks that have LinkPreference set to null"

At this point I was starting to snort quite heavily.

"But I told you, ItemLink doesn't have a property that refers to LinkPreference! The two are completely independent!"

"Well then you shouldn't want to query for them" said the Magic Pixies

"But I DO!"

"Well, can't you follow the association path up from ItemLink to Item and then down to LinkPreference?"

"Well, yes, I could, but wouldn't that result in the Items table being read in the query when there's absolutely no need for it?"

The Magic Pixies looked down at their feet

"...might do..."

"And isn't that horribly inefficient?"

They started fiddling with their shorts

"...might be..."

"And it's not that simple anyway, because it's a compond join on TWO properties!"


"WHAT was that?"

"yes!" said the Magic Pixies, with bottom lip sticking out.

"So can you perform this raw SQL query in your own way?"

SELECT item_links.* 
FROM item_links
LEFT OUTER JOIN link_prefs
ON item_links.item_id = link_prefs.owner_item_id
AND item_links.other_item_id = link_prefs.linked_item_Id
item_links.created_at < ?
AND link_prefs.shared IS NULL

"...might do, if you ask us nicely..."

(sigh) "OK, can you pleeeeeeease do it?"

They conferred for a moment in hushed whispers, and then turned back with a very smug-looking smile, and said

"Yes, we can - but we're not going to"


"You have to ask us in the right way"

Steam was starting to emerge from my ears

"And what IS the right way to ask you?"

They grinned even wider

"We're not going to tell you!"

And I stormed out of the room.

You see, the trouble I have with ORM systems is that they're all well and good as far as they go, and yes they can save large amounts of "donkey work" But sooner or later you nearly always come up against something that would be almost trivially easy to do with raw SQL, but the nice insulated ORM abstraction just can't deal with. I know that I'm probably looking at this from the "wrong" direction, I'm thinking about the data rather than the objects, but until the Magic Pixies start to play a bit more nicely, I'm always going to be a bit suspicious of them.

(deep breaths..... calm.... happy thoughts..... nearly Christmas....)

Friday, December 08, 2006

Friday Brainf**k : How Unique Is A Phrase?

Eep - posts have been a bit thin on the ground recently, due to big crunch time on SONAR, but here's an interesting question that's just cropped up, and I'm not sure of the answer to, and blogging the question might just help me get my own thoughts on it straight....

First, some context. (Yeah, yeah, skip the context and go to the question) I'm writing the DAO's for the system as a layer of abstract interfaces, with a default implementation based on Hibernate, and using the everything-is-an-item pattern that I've blogged about before.

Aside: This pattern, in conjunction with Java Generics - a Java 5 mechanism that's kind of like C++'s Templating system, but without the horrors that infest the STL Standard Template Library - has lead to a really nice way of getting lots of basic operations (e.g. CRUD) "for free", which deserves a post of its own and I'll blog about it next time I get chance.

In this design, for every type of item, there's a corresponding DAO. In the DAO, there's a save() method which either adds the passed object to the DB, or updates it if it already exists.

This save() method is also responsible for throwing an exception if the given object can't be saved, as it would break the application's business rules for uniqueness. So each implementation of the save() method calls an isDuplicate() method, which is defined by default on the abstract ItemDAO, and can be overridden as appropriate on the subclass DAOs. For instance, it's not acceptable to have two Email records with the same messageID, but it's perfectly fine to have two Person records called John Smith - and this is where the interesting question arises...

In our model, a Theme is also an Item. A Theme, in this case, being a word or phrase that has been extracted from the content of an Item due to it being "potentially interesting". There's a whole load of extreme Lisp cleverness being worked on by M'Colleague Craig McMillan, he of the prodigious beard and piratical proclivities, regarding how you determine a word or group of words is potentially interesting, but in the words of Frank Drebin, that's not important right now.

The question is, what makes a Theme unique? On a superficial level, you can say that a Theme is a group of characters that make up words, so that group of characters must be unique. In other words, no two Theme records must exist with the same String in the "Title" field. But if you think about it a bit more, that might not actually be the case.

For instance, the problems of homonyms (a word that has the same pronunciation and spelling as another word, but a different meaning - e.g. "bat" the animal and "bat" in cricket) and polysemy (capacity of a word or phrase with multiple, related meanings that derive from the same etymology - e.g. "bank on it" with bank meaning to rely upon something, which derives from the reputation of bank-the-financial-instution for reliability) are perennial problems for Natural Language Processing. Should we take that into account in the model? Can we?

In the above cases, I can probably take the reasonable position that because the words are spelled the same, I can assume that they are actually related in this sense, and if two emails refer frequently to "bush", they should be pointing to the same DB record for the "bush" theme, regardless of whether they were talking about a US President, a shrub, or the Australian outback - and homonymy and polysemy can just be swept under the carpet. (Hmm, starting to get a lot of bulges under this carpet in the office here...where's my hammer?)

However, when you start to think about the possibility for the system to be dealing with multiple locales and even multiple languages, which may well be the case in large multi-national corporates, another, related, linguistic term starts to rear it's ugly head - the Heterologue.

A Heterologue is a word that occurs in multiple languages, possibly with completely different meanings. For instance, the syllable "bat" in Cantonese means the number eight (at least, when pronounced with a high inflection), and having seen the way the lovely Lisa flits between English and Cantonese sometimes several times a sentence, even in emails where she types out the Cantonese word phonetically with English letters, this may well occur.

It's not just languages that show heterologues either - the same word or phrase in the same language in a different locale can have completely different meanings - the web is strewn with plenty of examples of British / American English ambiguities, the classic one being at Rocom when my newly-immigrated American colleague Stuart asked what Martin and I were doing for lunch, and I replied that we were driving into town so that Martin could pick up some fags, and did he want to come too?

So it boils down to this - should I include the locale of a Theme in the check for uniqueness along with the title (i.e. the actual string) or not? Is a given string of characters unique just within a particular locale, or globally? I'm tempted to say globally, but I have a nagging feeling that at some point in the not too distant future, that choice may turn round and bite me.