Friday, December 21, 2007

SQL QOTW - Solution!

To anyone who has been bashing their head against the desk with the SQL brain-buggerer I posted earlier, here is the solution I came up with. There may well be others, that may be more elegant, but here's the one that occurred to me...

OK, so you want to get all the documents up to a given cumulative total byte size.
A straightforward get-me-all-the-docs query looks like this:
SELECT id, filename, bytesize
FROM documents


We want to calculate the cumulative total size for all the records we've got so far, and then stop once we're about to pass the total. So the first problem is how do we work out the cumulative total?

Well, we're essentially dealing with an ordered list of records. We effectively want to work down the ordered list, calculating the cumulative total as we go.

So first lets impose an order on the list:

SELECT id, filename, bytesize
FROM documents
ORDER BY id

Now for each record, we want to work out the cumulative total. I had many thoughts on how to keep a running total, but I eventually realised that you don't actually need to. Following the old physics student's guiding principle of "if you find yourself with a problem you can't solve, rewrite it into a problem that you can solve"**, here's the trick - we don't actually need to calculate this as "x + (next x)" - as long as we use the same imposed order, we can get this with a rather cunning subquery:

SELECT id, filename, bytesize,
( SELECT SUM(bytesize)
FROM documents d2
WHERE d2.id <= documents.id
) AS cumulative_total
FROM documents
ORDER BY id


You see how it works? If we're working down an ordered list of ids-

1000
1001
1002
1003
1004
...
etc



- then keeping a running total is exactly equivalent to querying for the sum of all rows up-to-and-including the current row. So we can now move the subquery from the SELECT clause into the WHERE clause, and it gives us a sufficient criteria for accepting or rejecting rows:


SELECT id, filename, bytesize
FROM documents
WHERE ( SELECT SUM(bytesize)
FROM documents d2
WHERE d2.id <= documents.id
) < @max_cumulative_size * 1024 * 1024
ORDER BY id

( where @max_cumulative_size is the maximum total size in megabytes, obviously )

It all hinges on the ordering. The one proviso with this query is that, although you are free to order the results by whatever you choose, the ordering must be the same as the implicit ordering in the subquery.

And - shock, horror, step back in amazement - that sql works as-is across MySQL and Oracle!

Yay!

So I don't know about you, but after that I feel the need to go hit the climbing wall and have a recuperative pint or two afterwards.

Have a great Christmas everyone!




**This is a really useful skill to have, and it's not just applicable to Physics exams either. In fact, taken to extremes, it's perfectly possible, with a bit of cunning, to change a question about something you haven't revised, into a question about something you *have* revised....

I first realised the power of this at school, in English -
Q2.4 : Is Hamlet really mad? What is the purpose of his madness in the play and by the play? Discuss in not less than 3,000 words.

Answer : In any discussion concerning Hamlet, it is important to note that had he been Scottish, his situation would have been very similar to that of Macbeth....


:-)

SQL QOTW - SELECT all rows up to a cumulative total

Here's a little SQL test that arose here the other day. Let's say you have a DB table called documents, in which you have the following fields:

id
filename
bytesize
date_created
full_content

full_content is a BLOB, containing all the text extracted from whatever document the record represents.
bytesize is, cunningly enough, the size of the content in bytes.

You want to get all the documents, and do something to them - what you want to do is not important here, just that it involves the full_content. However, a moment of pondering will show the nasty bit of this problem... Let's say you may have tens of millions of documents, and the average bytesize is around a meg. Clearly attempting to read the full table into memory is not a good idea (at least until we get Terabytes of RAM in our servers).

A far better idea is to limit the maximum memory usage to some value, and only process up to X MB of documents in one batch. The next time round, you'll get the next X MB of docs, and so on.

We turned this problem over and over for a while, and we thought of a few ways of doing it, but all involved multiple queries, and/or things like temporary tables... and, dammitall, it SHOULD be do-able in one query, dammit man! (cue much indignant harumphing and twizzling of large mutton chop side whiskers, in a Victorian man-of-letters kind of way) Oh yeah, and here's the killer - it has to be portable across Oracle 10g and MySQL.

So the problem is this - given this table, can you construct a single query that will read all the documents up to a given maximum number of megabytes, that will run on Oracle 10g AND MySQL?

Well, after pondering this in several isolated attempts for a few days, I went back to it yesterday, and the answer just popped out straight away - and it's actually deceptively simple. I was going to just give you the answer, but I thought it might be a bit more fun to leave it as a challenge, so a big fat Brucie Bonus goes to the first person who gets it - I'll put the answer up here later on today, if no-one gets it in the meantime.