(captured this afternoon at 16:00 UTC.)
Sorry MM/Adobe guys, I just had to Print Screen and save this one, so that next time one of our servers goes down, I have ammunition :)
Rants, raves and random thoughts on Ruby, Rails and Rabbit, plus Java, CFMX, methodologies, and development in general. And too much alliteration.
Tuesday, January 31, 2006
Wednesday, January 25, 2006
UKCFUG: Advanced Text Search & Processing in Coldfusion
Just a quick plug for the presentation I'll be giving tomorrow night in tandem with Matt Perdeaux at the London CFUG - Advanced Text Searching and Processing Methods in ColdFusion.
Matt is going to go through the theory and maths behind an N-Dimensional Vector Space text engine, and then I'll be showing how we actually implemented it in CF. I'll also be delving into some of the lesser-known SQL Server text functions, and showing how you can use them in tandem with Full-Text searching to achieve a more intuitive search.
Should be a goodie.
Matt is going to go through the theory and maths behind an N-Dimensional Vector Space text engine, and then I'll be showing how we actually implemented it in CF. I'll also be delving into some of the lesser-known SQL Server text functions, and showing how you can use them in tandem with Full-Text searching to achieve a more intuitive search.
Should be a goodie.
Thursday, January 19, 2006
Architecture Question: Is An Item Collection An Item?
I'm planning out the architecture of the Big Project and I keep circling around one issue, with potentially huge consequences for the app - should a Collection of Items be an Item in itself?
In previous similar projects the answer has been "no", but that has led to extreme code bloat in some areas - search, for instance, where you can end up doing a monster 500-line-plus query that UNION's onto lots of different collections of items in different tables in order to have them "pretend" to be an Item which can be returned as a search result.
So I'm tempted to go for it this time - but conversely, I want to avoid the situation where queries end up joining onto the same generic table many times under 5 or 6 different aliases. That way lies spaghetti code, and a massively steep learning curve for anyone to pick up and maintain someone else's code further down the line.
Hmmm...
To give an example -
Say we have Groups and Photosets. Groups have members, Photosets have photos.
A Group has many Members, and a User can be in many Groups.
Logically you would then have :
So far, so sensible.
Now we introduce Photosets. Photosets have many Photos, and a Photo can be in many Photosets.
You can see where I'm going with this, right?
You would have :
Do we stick with the "old" model of seperate entities/CFCs/DB tables for Group, Group Member, Photoset, PhotoSetPhoto ?
Or....
Do we abstract the common elements and have an ItemCollection table, and a link table called, say, ItemCollectionItem ?
So then a Group-IS-A-ItemCollection, and so is a PhotoSet.
A GroupMember-IS-A-ItemCollectionItem, and so is a PhotoSetPhoto.
This makes sense so far. (Well, it does to me, anyway...)
Next step - if both a Group and a Photoset are subclasses of an ItemCollection, what do they have in common, apart from the fact that they both have items which "belong" to them?
Well, what properties do Groups have?
Groups will have a title and a description, a date they were created, and an owner. (Plus some other stuff - whther people have to be invited, etc.)
But a Photoset will also have a title, a description, a date it was created, and an owner. Plus some other stuff :)
And hang on, everything on the system will also have these properties - those are our commmon fields for all our content items.
So surely it makes sense to have an ItemCollection "being" a subclass of Item?
The drawbacks of this, as I mentioned above, are that if we go down this route, you quickly end up with a few huge monolithic tables (Item, ItemCollection, ItemCollectionItem) which can cause performance problems later on. If virtually every query on the system needs to join to one of those tables, then virtually every query can get queued up due to table/index locks when those tables are being written to or updated/deleted from. This could prove a major bottleneck, especially when large back-end batch jobs are being run.
You also quickly end up with unreadable queries.
For instance, to get a list of groups plus their member names and statuses, a nice readable query like this:
SELECT
tblGroup.vcTitle,
tblUser.vcFirstName,
tblUser.vcLastName,
tblGroupMember.cStatus
FROM
tblGroup
LEFT OUTER JOIN tblGroupMember
ON tblGroupMember.intGroupID = tblGroup.intGroupID
LEFT OUTER JOIN tblUser
ON tblUser.intItemID = tblGroupMember.intUserItemID
would have to change to something like this :
SELECT
_Group.vcTitle,
tblUser.vcFirstName,
tblUser.vcLastName,
tblGroupMember.cStatus
FROM
-- get group title from tblItems, aliased as _Group
tblItems _Group
-- get group members by joining onto generic tblItemCollectionItems
LEFT OUTER JOIN tblItemCollectionItems
ON tblItemCollectionItems.intParentItemID = _Group.intItemID
LEFT OUTER JOIN tblGroupMembers
ON tblGroupMembers.intItemCollectionItemID = tblItemCollectionItem.intItemCollectionItemID
-- now get the user records for each member
LEFT OUTER JOIN tblUser
ON tblUser.intItemID = tblItemCollectionItem.intChildItemID
- ick. I had to comment that as I was writing it to keep track of which table was which and where I was going with it. And that's just a simple query - you can imagine the nastiness involved once we start, say, listing group tags as an aggregation of tags applied by members of the group to other items.
Bleh.
So I'd be interested to hear anyone's experiences and thoughts on which way to go - sacrifice readability and hence maintainability of database code for a conceptually neat object model? Or vice versa?
In previous similar projects the answer has been "no", but that has led to extreme code bloat in some areas - search, for instance, where you can end up doing a monster 500-line-plus query that UNION's onto lots of different collections of items in different tables in order to have them "pretend" to be an Item which can be returned as a search result.
So I'm tempted to go for it this time - but conversely, I want to avoid the situation where queries end up joining onto the same generic table many times under 5 or 6 different aliases. That way lies spaghetti code, and a massively steep learning curve for anyone to pick up and maintain someone else's code further down the line.
Hmmm...
To give an example -
Say we have Groups and Photosets. Groups have members, Photosets have photos.
A Group has many Members, and a User can be in many Groups.
Logically you would then have :
- a table for Group
- a table for User
- a link table to join them together
( i.e. a many-to-many relationship done via an intermediary table )
So far, so sensible.
Now we introduce Photosets. Photosets have many Photos, and a Photo can be in many Photosets.
You can see where I'm going with this, right?
You would have :
- a table for Photoset
- a table for Photo
- a link table to join them together
Do we stick with the "old" model of seperate entities/CFCs/DB tables for Group, Group Member, Photoset, PhotoSetPhoto ?
Or....
Do we abstract the common elements and have an ItemCollection table, and a link table called, say, ItemCollectionItem ?
So then a Group-IS-A-ItemCollection, and so is a PhotoSet.
A GroupMember-IS-A-ItemCollectionItem, and so is a PhotoSetPhoto.
This makes sense so far. (Well, it does to me, anyway...)
Next step - if both a Group and a Photoset are subclasses of an ItemCollection, what do they have in common, apart from the fact that they both have items which "belong" to them?
Well, what properties do Groups have?
Groups will have a title and a description, a date they were created, and an owner. (Plus some other stuff - whther people have to be invited, etc.)
But a Photoset will also have a title, a description, a date it was created, and an owner. Plus some other stuff :)
And hang on, everything on the system will also have these properties - those are our commmon fields for all our content items.
So surely it makes sense to have an ItemCollection "being" a subclass of Item?
The drawbacks of this, as I mentioned above, are that if we go down this route, you quickly end up with a few huge monolithic tables (Item, ItemCollection, ItemCollectionItem) which can cause performance problems later on. If virtually every query on the system needs to join to one of those tables, then virtually every query can get queued up due to table/index locks when those tables are being written to or updated/deleted from. This could prove a major bottleneck, especially when large back-end batch jobs are being run.
You also quickly end up with unreadable queries.
For instance, to get a list of groups plus their member names and statuses, a nice readable query like this:
SELECT
tblGroup.vcTitle,
tblUser.vcFirstName,
tblUser.vcLastName,
tblGroupMember.cStatus
FROM
tblGroup
LEFT OUTER JOIN tblGroupMember
ON tblGroupMember.intGroupID = tblGroup.intGroupID
LEFT OUTER JOIN tblUser
ON tblUser.intItemID = tblGroupMember.intUserItemID
would have to change to something like this :
SELECT
_Group.vcTitle,
tblUser.vcFirstName,
tblUser.vcLastName,
tblGroupMember.cStatus
FROM
-- get group title from tblItems, aliased as _Group
tblItems _Group
-- get group members by joining onto generic tblItemCollectionItems
LEFT OUTER JOIN tblItemCollectionItems
ON tblItemCollectionItems.intParentItemID = _Group.intItemID
LEFT OUTER JOIN tblGroupMembers
ON tblGroupMembers.intItemCollectionItemID = tblItemCollectionItem.intItemCollectionItemID
-- now get the user records for each member
LEFT OUTER JOIN tblUser
ON tblUser.intItemID = tblItemCollectionItem.intChildItemID
- ick. I had to comment that as I was writing it to keep track of which table was which and where I was going with it. And that's just a simple query - you can imagine the nastiness involved once we start, say, listing group tags as an aggregation of tags applied by members of the group to other items.
Bleh.
So I'd be interested to hear anyone's experiences and thoughts on which way to go - sacrifice readability and hence maintainability of database code for a conceptually neat object model? Or vice versa?
Wednesday, January 18, 2006
A Question of Controllers
So I'm nearing the end of the planning stage of a pretty big project. By big, I mean it's for a big blue chip company that you will certainly have heard of, and the project itself is big in terms of code. I can't go into too much detail about the specifics, but if you think of Flickr, del.icio.us, Smartgroups, Craigslist, plus the NIMHE KC all rolled into one, you're not far off. And the timescale, as ever, is extremely tight.
We've almost got enough of the functionality nailed down to start working out the entity relationship diagram (ERD) and object model that we'll be using, and decide on what framework to use.
I'm pretty sold on the idea of using a ValueObject/TransferObject + DAO + Controller pattern for the entity objects, and keeping all data members private - I've used this on a previous project with diabolical levels of complexity in the data, and it was one of the best design decisions I ever took. The extra tedium of having to write getters and setters over and over again, paid off ten fold in having the complexity encapsulated. Plus it meant we could do neat value-adds like having the object itself keep track of when it had been modified, and enable/disable "save" buttons appropriately, to make the app feel more like a desktop app - which is, after all, The Holy Grail of t'interweb.
So that's the Model part of MVC pretty much sorted, and the persistence layer should be encapsulated nicely within the Model itself. The View part is pretty transferable whatever framework we end up using - a display template is a display template and should never be anything other than a display template, whether in Fusebox, Mach-II or Model-Glue, or whatever.
The decision I am still pondering is which framework to use for the Controller part of the app. We've tended to use a home-brewed hybrid style up till now - a Fusebox 3 app with CFCs instead of act_ files, which we tentatively called HOOF - Hybrid Object-Oriented Fusebox. However, in some recent projects of comparable scale, there's been several times I've found myself thinking "damn, I'm copying-and-pasting this bit of code AGAIN, if I could just broadcast an event this could be so much easier..."
I'd like to make the leap to Mach-II, or Model-Glue, but the trouble is that tight timescale, and the rest of the team. Everyone here is very comfortable with our HOOF methodology, and from my previous tinkering with small Mach-II apps, making the leap to truly event-driven architecture is a subtle but significant shift in thinking. You can find yourself feeling lost without the safety of a procedural code section explicitly tying the disparate bits together, like the first time you try lead-climbing a large wall without the reassuring sight of the rope stretching out above you.
So when the deadline is tight, the pressure is high, and the app is fiendishly complicated, is it worth taking the risk of moving to a methodology which promises to help manage the complexity, but in which the team has little or no experience? Or should we stick with an approach which has its limitations, but we can code in our sleep?
Known knowns vs. Unknown Unknowns - Decisions, decisions.....
We've almost got enough of the functionality nailed down to start working out the entity relationship diagram (ERD) and object model that we'll be using, and decide on what framework to use.
I'm pretty sold on the idea of using a ValueObject/TransferObject + DAO + Controller pattern for the entity objects, and keeping all data members private - I've used this on a previous project with diabolical levels of complexity in the data, and it was one of the best design decisions I ever took. The extra tedium of having to write getters and setters over and over again, paid off ten fold in having the complexity encapsulated. Plus it meant we could do neat value-adds like having the object itself keep track of when it had been modified, and enable/disable "save" buttons appropriately, to make the app feel more like a desktop app - which is, after all, The Holy Grail of t'interweb.
So that's the Model part of MVC pretty much sorted, and the persistence layer should be encapsulated nicely within the Model itself. The View part is pretty transferable whatever framework we end up using - a display template is a display template and should never be anything other than a display template, whether in Fusebox, Mach-II or Model-Glue, or whatever.
The decision I am still pondering is which framework to use for the Controller part of the app. We've tended to use a home-brewed hybrid style up till now - a Fusebox 3 app with CFCs instead of act_ files, which we tentatively called HOOF - Hybrid Object-Oriented Fusebox. However, in some recent projects of comparable scale, there's been several times I've found myself thinking "damn, I'm copying-and-pasting this bit of code AGAIN, if I could just broadcast an event this could be so much easier..."
I'd like to make the leap to Mach-II, or Model-Glue, but the trouble is that tight timescale, and the rest of the team. Everyone here is very comfortable with our HOOF methodology, and from my previous tinkering with small Mach-II apps, making the leap to truly event-driven architecture is a subtle but significant shift in thinking. You can find yourself feeling lost without the safety of a procedural code section explicitly tying the disparate bits together, like the first time you try lead-climbing a large wall without the reassuring sight of the rope stretching out above you.
So when the deadline is tight, the pressure is high, and the app is fiendishly complicated, is it worth taking the risk of moving to a methodology which promises to help manage the complexity, but in which the team has little or no experience? Or should we stick with an approach which has its limitations, but we can code in our sleep?
Known knowns vs. Unknown Unknowns - Decisions, decisions.....
Tuesday, January 10, 2006
London Web 2.0 Summit
I just registered for the London Web 2.0 "Summit". I put "Summit" in quotes, as that's what they call it but for me that word is inextricably associated with the grey days of the cold war, with representatives of two monolithic super powers squaring off over a conference table and deciding how they were going to carve up Europe between them.
Mind you, there's Tom Coates of Yahoo! and Steven Meschkat of Google on the same bill, so maybe that analogy wasn't too far off...
Looks like an interesting list of speakers though - people from 37signals, del.icio.us, Flickr, Feedburner, etc etc etc. I'll be particularly interested to hear how they coped with sudden explosive growth - what re-engineering did they have to do and how did they handle it when their little pet prototype app became a poster child for the Web 2.0 crowd? If RSS Ate My Server, then how did they get round these issues on, say, del.icio.us? OK, so they're not using CF, but to some extent I think it's a generic issue.
Mind you, there's Tom Coates of Yahoo! and Steven Meschkat of Google on the same bill, so maybe that analogy wasn't too far off...
Looks like an interesting list of speakers though - people from 37signals, del.icio.us, Flickr, Feedburner, etc etc etc. I'll be particularly interested to hear how they coped with sudden explosive growth - what re-engineering did they have to do and how did they handle it when their little pet prototype app became a poster child for the Web 2.0 crowd? If RSS Ate My Server, then how did they get round these issues on, say, del.icio.us? OK, so they're not using CF, but to some extent I think it's a generic issue.
Subscribe to:
Posts (Atom)