Here's something to make you stroke your chin and stare into the distance for a moment or two -
A customer asked us about crunching a large amount of email. Some rough back-of-an-envelope calculations lead us to expect up to about 150GB of it in one go. Our experiments with the Enron emails on MySQL produced about a 1:10 ratio of data-in to database size, a ratio of roughly 1:1 on data-in to language model size and full-text index, and a roughly linear (not 1:1 though!) increase in processing time per email with elapsed cumulative crunching time.
So that means we can expect a database size of up to 1.5 Terabytes, plus another 300GB of language models and full-text index...
(We'd be using Oracle for the DB, as it comes with some very handy out of the box management and monitoring tools, performance advisor alerts, and all that kaboodle, and it may be that the 1:10 ratio is different on Oracle - we're looking at that at the moment)
So how do you go about planning for a database of that size?
We can't even defer the question and scale-as-we-go, as it's going to be growing to that size very quickly, within days of kicking off. It's got to be right, straight from the word go. I've worked with very large email datasets before, on Smartgroups, but in Freeserve we had a big team of specialist UNIX engineers to manage it.
How do you spec up the disk configuration, knowing that the database files are going to be that huge? We've tended to go for RAID 1 (mirrored) by default, as an until-now acceptable balance between simplicity and resilience. But this means that if you have, say, 2 x 500GB disks, you only actually get 500GB of storage. Dell provide up to 1TB disks on some servers, but damn it's pricey... we'll no doubt end up going for a combination of mirrored, striped configs, but that means more complexity of course.
On that subject, how do you organise the file system? And, crucially, how do you go about arranging back-ups for all that data? The last thing you want is for 1.5TB of data to get lost with no backup, and have to be re-calculated from scratch. The backups have to be stored somewhere, and even just the process of shifting that amount of data from one storage device to another is non-trivial. I mean, shifting 1.5x1013 bits, even over a dedicated, max-ed out Gigabit ethernet is going to take at least 1.5x104 seconds - or just over 4hrs...
Hmm, questions, questions, questions, chin stroke, thousand-yard-stare, tap mouth absently... I'll be pondering this for a while, I think.
Mind you, my old university mate Dan once emailed me back in his PhD days, saying "Hey, you know about databases, don't you? Could you give me a quick intro? I've got to write one to handle data from the SLAC particle accelerator - it's got to handle terabytes of data per day...."
- and that was way back in about 1997 or so, when a 4GB drive would cost you nearly $500. Maybe time to give him a ping on Facebook....