In blatant defiance of all known precedent, someone has not only read my climbing/mountaineering/great-outdoors-in-general blog, but actually thinks it's worth advertising on!
They emailed me to ask how much I'd charge per month for text ads - trouble is, I don't really know, I, like 99% of the rest of the blogosphere, just shoved AdSense on there and forgot about it. Like, really forgot about it, I haven't even had my first Google cheque yet.
So, er.... hmmm.... I don't want to undersell myself, but equally I don't want to give them a price that instantly marks me out as a no-hoper. Ponder, ponder, stick finger in the air...
But I'm a physicist by training, dammit, I should be able to work this out, surely. So let's do a quick calculation.....
It's a pretty low-traffic blog (a couple of hundred hits a day), but it's quite targetted, and AdSense shows an effective CPM of around $3 with a clickthrough ratio of just over 1%.
So, if I'm getting H hits per day, and the CPM (cost per thousand, I think?) is C, then the price per month should be (avg number of days in a month) * (H / 1000) * C, right?
That gives me a baseline price of around $30 / month. Obviously I can adjust that depending on page placement, as ads just below the banner get more clicks than in-page, but does that sound about right for a low-traffic, specialist enthusiast blog?
Rants, raves and random thoughts on Ruby, Rails and Rabbit, plus Java, CFMX, methodologies, and development in general. And too much alliteration.
Monday, October 29, 2007
Saturday, October 13, 2007
Planning for a 1.5 Terabyte Database
Here's something to make you stroke your chin and stare into the distance for a moment or two -
A customer asked us about crunching a large amount of email. Some rough back-of-an-envelope calculations lead us to expect up to about 150GB of it in one go. Our experiments with the Enron emails on MySQL produced about a 1:10 ratio of data-in to database size, a ratio of roughly 1:1 on data-in to language model size and full-text index, and a roughly linear (not 1:1 though!) increase in processing time per email with elapsed cumulative crunching time.
So that means we can expect a database size of up to 1.5 Terabytes, plus another 300GB of language models and full-text index...
(We'd be using Oracle for the DB, as it comes with some very handy out of the box management and monitoring tools, performance advisor alerts, and all that kaboodle, and it may be that the 1:10 ratio is different on Oracle - we're looking at that at the moment)
So how do you go about planning for a database of that size?
We can't even defer the question and scale-as-we-go, as it's going to be growing to that size very quickly, within days of kicking off. It's got to be right, straight from the word go. I've worked with very large email datasets before, on Smartgroups, but in Freeserve we had a big team of specialist UNIX engineers to manage it.
How do you spec up the disk configuration, knowing that the database files are going to be that huge? We've tended to go for RAID 1 (mirrored) by default, as an until-now acceptable balance between simplicity and resilience. But this means that if you have, say, 2 x 500GB disks, you only actually get 500GB of storage. Dell provide up to 1TB disks on some servers, but damn it's pricey... we'll no doubt end up going for a combination of mirrored, striped configs, but that means more complexity of course.
On that subject, how do you organise the file system? And, crucially, how do you go about arranging back-ups for all that data? The last thing you want is for 1.5TB of data to get lost with no backup, and have to be re-calculated from scratch. The backups have to be stored somewhere, and even just the process of shifting that amount of data from one storage device to another is non-trivial. I mean, shifting 1.5x1013 bits, even over a dedicated, max-ed out Gigabit ethernet is going to take at least 1.5x104 seconds - or just over 4hrs...
Hmm, questions, questions, questions, chin stroke, thousand-yard-stare, tap mouth absently... I'll be pondering this for a while, I think.
Mind you, my old university mate Dan once emailed me back in his PhD days, saying "Hey, you know about databases, don't you? Could you give me a quick intro? I've got to write one to handle data from the SLAC particle accelerator - it's got to handle terabytes of data per day...."
- and that was way back in about 1997 or so, when a 4GB drive would cost you nearly $500. Maybe time to give him a ping on Facebook....
A customer asked us about crunching a large amount of email. Some rough back-of-an-envelope calculations lead us to expect up to about 150GB of it in one go. Our experiments with the Enron emails on MySQL produced about a 1:10 ratio of data-in to database size, a ratio of roughly 1:1 on data-in to language model size and full-text index, and a roughly linear (not 1:1 though!) increase in processing time per email with elapsed cumulative crunching time.
So that means we can expect a database size of up to 1.5 Terabytes, plus another 300GB of language models and full-text index...
(We'd be using Oracle for the DB, as it comes with some very handy out of the box management and monitoring tools, performance advisor alerts, and all that kaboodle, and it may be that the 1:10 ratio is different on Oracle - we're looking at that at the moment)
So how do you go about planning for a database of that size?
We can't even defer the question and scale-as-we-go, as it's going to be growing to that size very quickly, within days of kicking off. It's got to be right, straight from the word go. I've worked with very large email datasets before, on Smartgroups, but in Freeserve we had a big team of specialist UNIX engineers to manage it.
How do you spec up the disk configuration, knowing that the database files are going to be that huge? We've tended to go for RAID 1 (mirrored) by default, as an until-now acceptable balance between simplicity and resilience. But this means that if you have, say, 2 x 500GB disks, you only actually get 500GB of storage. Dell provide up to 1TB disks on some servers, but damn it's pricey... we'll no doubt end up going for a combination of mirrored, striped configs, but that means more complexity of course.
On that subject, how do you organise the file system? And, crucially, how do you go about arranging back-ups for all that data? The last thing you want is for 1.5TB of data to get lost with no backup, and have to be re-calculated from scratch. The backups have to be stored somewhere, and even just the process of shifting that amount of data from one storage device to another is non-trivial. I mean, shifting 1.5x1013 bits, even over a dedicated, max-ed out Gigabit ethernet is going to take at least 1.5x104 seconds - or just over 4hrs...
Hmm, questions, questions, questions, chin stroke, thousand-yard-stare, tap mouth absently... I'll be pondering this for a while, I think.
Mind you, my old university mate Dan once emailed me back in his PhD days, saying "Hey, you know about databases, don't you? Could you give me a quick intro? I've got to write one to handle data from the SLAC particle accelerator - it's got to handle terabytes of data per day...."
- and that was way back in about 1997 or so, when a 4GB drive would cost you nearly $500. Maybe time to give him a ping on Facebook....
Labels:
database,
disk,
filesystem,
oracle,
raid,
scalability,
size,
storage,
terabyte
Friday, October 12, 2007
It Lives..!
OK, so posts have been a bit thin on the ground lately, due to being very busy at work but mostly on pre-sales stuff or client-confidential stuff.... but in general, just being so damn busy all the time.
I've been to the States on a client visit, where we also went to see the supremely-smashing Nellie McKay play at The Birchmere. I'd never heard her before, and Peter just said "She's kind of hard to describe..." - I ended up describing the gig as like Frank Zappa and Paul Simon jamming with Suzanne Vega and Phoebe from Friends, and I think that's about as close as I can get.
I've been sport-climbing in the magnificent El Chorro in Andalucia, which I'll post more about on Dynamove (which I've also been neglecting lately)
I've been studying the Social Dynamics of Werewolves and Lynch Mobs...and if you want an atmospheric bar in which to do so, you couldn't get better than Shunt - it was like walking into the bar in An American Werewolf In Paris
I've also been coding more in the last couple of weeks, working on LDAP importers and such, and I've come up against some interesting questions of scale, which I'll post more about later....
I've been to the States on a client visit, where we also went to see the supremely-smashing Nellie McKay play at The Birchmere. I'd never heard her before, and Peter just said "She's kind of hard to describe..." - I ended up describing the gig as like Frank Zappa and Paul Simon jamming with Suzanne Vega and Phoebe from Friends, and I think that's about as close as I can get.
I've been sport-climbing in the magnificent El Chorro in Andalucia, which I'll post more about on Dynamove (which I've also been neglecting lately)
I've been studying the Social Dynamics of Werewolves and Lynch Mobs...and if you want an atmospheric bar in which to do so, you couldn't get better than Shunt - it was like walking into the bar in An American Werewolf In Paris
I've also been coding more in the last couple of weeks, working on LDAP importers and such, and I've come up against some interesting questions of scale, which I'll post more about later....
Subscribe to:
Posts (Atom)