Thursday, January 19, 2006

Architecture Question: Is An Item Collection An Item?

I'm planning out the architecture of the Big Project and I keep circling around one issue, with potentially huge consequences for the app - should a Collection of Items be an Item in itself?

In previous similar projects the answer has been "no", but that has led to extreme code bloat in some areas - search, for instance, where you can end up doing a monster 500-line-plus query that UNION's onto lots of different collections of items in different tables in order to have them "pretend" to be an Item which can be returned as a search result.

So I'm tempted to go for it this time - but conversely, I want to avoid the situation where queries end up joining onto the same generic table many times under 5 or 6 different aliases. That way lies spaghetti code, and a massively steep learning curve for anyone to pick up and maintain someone else's code further down the line.

Hmmm...

To give an example -

Say we have Groups and Photosets. Groups have members, Photosets have photos.
A Group has many Members, and a User can be in many Groups.
Logically you would then have :

  • a table for Group

  • a table for User

  • a link table to join them together
    ( i.e. a many-to-many relationship done via an intermediary table )



So far, so sensible.

Now we introduce Photosets. Photosets have many Photos, and a Photo can be in many Photosets.

You can see where I'm going with this, right?

You would have :

  • a table for Photoset

  • a table for Photo

  • a link table to join them together



Do we stick with the "old" model of seperate entities/CFCs/DB tables for Group, Group Member, Photoset, PhotoSetPhoto ?

Or....

Do we abstract the common elements and have an ItemCollection table, and a link table called, say, ItemCollectionItem ?

So then a Group-IS-A-ItemCollection, and so is a PhotoSet.
A GroupMember-IS-A-ItemCollectionItem, and so is a PhotoSetPhoto.

This makes sense so far. (Well, it does to me, anyway...)

Next step - if both a Group and a Photoset are subclasses of an ItemCollection, what do they have in common, apart from the fact that they both have items which "belong" to them?

Well, what properties do Groups have?
Groups will have a title and a description, a date they were created, and an owner. (Plus some other stuff - whther people have to be invited, etc.)
But a Photoset will also have a title, a description, a date it was created, and an owner. Plus some other stuff :)

And hang on, everything on the system will also have these properties - those are our commmon fields for all our content items.

So surely it makes sense to have an ItemCollection "being" a subclass of Item?

The drawbacks of this, as I mentioned above, are that if we go down this route, you quickly end up with a few huge monolithic tables (Item, ItemCollection, ItemCollectionItem) which can cause performance problems later on. If virtually every query on the system needs to join to one of those tables, then virtually every query can get queued up due to table/index locks when those tables are being written to or updated/deleted from. This could prove a major bottleneck, especially when large back-end batch jobs are being run.

You also quickly end up with unreadable queries.
For instance, to get a list of groups plus their member names and statuses, a nice readable query like this:

SELECT
tblGroup.vcTitle,
tblUser.vcFirstName,
tblUser.vcLastName,
tblGroupMember.cStatus

FROM
tblGroup
LEFT OUTER JOIN tblGroupMember
ON tblGroupMember.intGroupID = tblGroup.intGroupID
LEFT OUTER JOIN tblUser
ON tblUser.intItemID = tblGroupMember.intUserItemID

would have to change to something like this :

SELECT
_Group.vcTitle,
tblUser.vcFirstName,
tblUser.vcLastName,
tblGroupMember.cStatus

FROM
-- get group title from tblItems, aliased as _Group
tblItems _Group
-- get group members by joining onto generic tblItemCollectionItems
LEFT OUTER JOIN tblItemCollectionItems
ON tblItemCollectionItems.intParentItemID = _Group.intItemID
LEFT OUTER JOIN tblGroupMembers
ON tblGroupMembers.intItemCollectionItemID = tblItemCollectionItem.intItemCollectionItemID
-- now get the user records for each member
LEFT OUTER JOIN tblUser
ON tblUser.intItemID = tblItemCollectionItem.intChildItemID


- ick. I had to comment that as I was writing it to keep track of which table was which and where I was going with it. And that's just a simple query - you can imagine the nastiness involved once we start, say, listing group tags as an aggregation of tags applied by members of the group to other items.

Bleh.

So I'd be interested to hear anyone's experiences and thoughts on which way to go - sacrifice readability and hence maintainability of database code for a conceptually neat object model? Or vice versa?

4 comments:

cosmin said...

me too :)

Anonymous said...

Create views to simplify the queries.

Most database will optimize the view, which will give you some speed increases.

If you're using MS SQL 2k+, you can even index a view--making them much more useful.

Alistair Davidson said...

Dan

That's a nice idea about the views, thanks for that. I've always tried to keep the actual database layer as thin as possible, on the basis that that would make it as portable as possible. On the other hand, I've only ever had to port database storage engine for a system twice in my whole career, and I don't see it happening here in the forseeable future....maybe I'll give it a try.

Al

Anonymous said...

You can only index views in the more expensive versions of SQL 2K. I assume the same is true for SQL2K5 The indexing also has quite a high storage space hit as essentially your view (every row, all the columns) is stored as a discrete item on disk at the time you index the view (SQL2000 here not sure if 2005 behaves better) thus doubling your disk storage requirement as you now need to keep the data in the underlying tables and once again in the view. Its very rare to have to index a view as the underlying indexes in the tables that make up the view should provide you with the query performance you are looking for.

I've seen them used for providing aggregations that are processing intensive, such as a sum of all values for a day as opposed to all the individual transactions for a day but I'd be surprised if an indexed view offered you any significant gain in the circumstance you've outlined. A good set of indexes on the underlying tables would do you as much good and cost less in storage.

The question regarding the use one big collection or use lots of small ones. I'd always go for the small ones. The more complex you make it the easier it is to break. I also believe in the 'do one thing' approach. Each component should do one thing that way its easier for the poor bod who follows you to unravel whats going on.

Just because you can doesn't mean you should. This stuff is complicated enough without setting the bar any higher.

A good way of using views would be to have a view for each different type of collection if you do decide to use one big generic set of tables. That would abstract out some of the complexity and cost very little resource wise.


Steve
steve@powell.net