Saturday, October 13, 2007

Planning for a 1.5 Terabyte Database

Here's something to make you stroke your chin and stare into the distance for a moment or two -

A customer asked us about crunching a large amount of email. Some rough back-of-an-envelope calculations lead us to expect up to about 150GB of it in one go. Our experiments with the Enron emails on MySQL produced about a 1:10 ratio of data-in to database size, a ratio of roughly 1:1 on data-in to language model size and full-text index, and a roughly linear (not 1:1 though!) increase in processing time per email with elapsed cumulative crunching time.

So that means we can expect a database size of up to 1.5 Terabytes, plus another 300GB of language models and full-text index...

(We'd be using Oracle for the DB, as it comes with some very handy out of the box management and monitoring tools, performance advisor alerts, and all that kaboodle, and it may be that the 1:10 ratio is different on Oracle - we're looking at that at the moment)

So how do you go about planning for a database of that size?

We can't even defer the question and scale-as-we-go, as it's going to be growing to that size very quickly, within days of kicking off. It's got to be right, straight from the word go. I've worked with very large email datasets before, on Smartgroups, but in Freeserve we had a big team of specialist UNIX engineers to manage it.

How do you spec up the disk configuration, knowing that the database files are going to be that huge? We've tended to go for RAID 1 (mirrored) by default, as an until-now acceptable balance between simplicity and resilience. But this means that if you have, say, 2 x 500GB disks, you only actually get 500GB of storage. Dell provide up to 1TB disks on some servers, but damn it's pricey... we'll no doubt end up going for a combination of mirrored, striped configs, but that means more complexity of course.

On that subject, how do you organise the file system? And, crucially, how do you go about arranging back-ups for all that data? The last thing you want is for 1.5TB of data to get lost with no backup, and have to be re-calculated from scratch. The backups have to be stored somewhere, and even just the process of shifting that amount of data from one storage device to another is non-trivial. I mean, shifting 1.5x1013 bits, even over a dedicated, max-ed out Gigabit ethernet is going to take at least 1.5x104 seconds - or just over 4hrs...

Hmm, questions, questions, questions, chin stroke, thousand-yard-stare, tap mouth absently... I'll be pondering this for a while, I think.

Mind you, my old university mate Dan once emailed me back in his PhD days, saying "Hey, you know about databases, don't you? Could you give me a quick intro? I've got to write one to handle data from the SLAC particle accelerator - it's got to handle terabytes of data per day...."
- and that was way back in about 1997 or so, when a 4GB drive would cost you nearly $500. Maybe time to give him a ping on Facebook....


Zac Spitzer / Fastmovingtarget said...

RAID 10 which is stripe and mirror everything is good speed wise and very redundant. oracle recommends this

stick with sata, sas and scsci are expensive. make sure you buy a good raid controller as well

Brixon said...

I would start searching for the term "VLDB" (Very Large Database).

Alistair Davidson said...

thanks for the references zac & brixon, i'll do some reading up

Nathan Strutz said...

Depending on your funding, I would recommend a SAN - an external disk array that attaches to the server like a local HDD. SANs these days can automate backups (even database backups with differentials), can use multiple SCSI RAID configurations on the same partition, can attach to multiple servers (with different or same partitions), can be nearly infinitely expandable, can use SCSI and SATA drives (some mixed even in the same partition), are much, much faster than any local drive array can be, has live hot-swappable drives, can function as a NAS (shared network drive), and more.

Ok maybe it's overkill, but it's THE way to go for large databases and busy systems.

Alistair Davidson said...

Thanks Nathan. We're also looking at the feasibility of farming the whole thing out to servers rented from Amazon :-)

eric said...

Have you seen this PDF from Google titled, "Failure Trends in a Large Disk Drive Population"? I'm pretty sure it is the Most Comprehensive Study Of Its Kind™. Also worthwhile is Carnegie Mellon's similar project.

Both got me looking at what exactly is Safe Backup™, and if memory serves: SATA drives are no less reliable; multiple redundancy is critical—tripled mirrors, with a determinate replacement policy.

Of course, the one thing both studies seemed to indicate was that the Purchasing Recommendations From Dell™ are expensive and bureaucratically-safe solutions first and foremost, but of negligable performance and reliability advantage over dramatically more inexpensive solutions.

Good luck.

Brixon said...

here is one more place. I have started adding to my list of places to search.

A delicious search is slow and does not work well for complex queries, but if you can guess what it might have been tagged with then it usually finds some interesting resources.

Zac Spitzer / Fastmovingtarget said...

SAS and SCSI are meant to be a bit more reliable, but relatively speaking, you are more likely to be able to afford a RAID 10 config with SATA,

With SAS and SCSI the disks are so damn small and expensive you are either going to spend way more money or go for a less redundant solution.

This type of solution is interesting.

External SATA enclosures... I would avoid the multi-port channel stuff and connect each drive via a cable directly to the RAID controller..