Friday, December 21, 2007

SQL QOTW - Solution!

To anyone who has been bashing their head against the desk with the SQL brain-buggerer I posted earlier, here is the solution I came up with. There may well be others, that may be more elegant, but here's the one that occurred to me...

OK, so you want to get all the documents up to a given cumulative total byte size.
A straightforward get-me-all-the-docs query looks like this:
SELECT id, filename, bytesize
FROM documents


We want to calculate the cumulative total size for all the records we've got so far, and then stop once we're about to pass the total. So the first problem is how do we work out the cumulative total?

Well, we're essentially dealing with an ordered list of records. We effectively want to work down the ordered list, calculating the cumulative total as we go.

So first lets impose an order on the list:

SELECT id, filename, bytesize
FROM documents
ORDER BY id

Now for each record, we want to work out the cumulative total. I had many thoughts on how to keep a running total, but I eventually realised that you don't actually need to. Following the old physics student's guiding principle of "if you find yourself with a problem you can't solve, rewrite it into a problem that you can solve"**, here's the trick - we don't actually need to calculate this as "x + (next x)" - as long as we use the same imposed order, we can get this with a rather cunning subquery:

SELECT id, filename, bytesize,
( SELECT SUM(bytesize)
FROM documents d2
WHERE d2.id <= documents.id
) AS cumulative_total
FROM documents
ORDER BY id


You see how it works? If we're working down an ordered list of ids-

1000
1001
1002
1003
1004
...
etc



- then keeping a running total is exactly equivalent to querying for the sum of all rows up-to-and-including the current row. So we can now move the subquery from the SELECT clause into the WHERE clause, and it gives us a sufficient criteria for accepting or rejecting rows:


SELECT id, filename, bytesize
FROM documents
WHERE ( SELECT SUM(bytesize)
FROM documents d2
WHERE d2.id <= documents.id
) < @max_cumulative_size * 1024 * 1024
ORDER BY id

( where @max_cumulative_size is the maximum total size in megabytes, obviously )

It all hinges on the ordering. The one proviso with this query is that, although you are free to order the results by whatever you choose, the ordering must be the same as the implicit ordering in the subquery.

And - shock, horror, step back in amazement - that sql works as-is across MySQL and Oracle!

Yay!

So I don't know about you, but after that I feel the need to go hit the climbing wall and have a recuperative pint or two afterwards.

Have a great Christmas everyone!




**This is a really useful skill to have, and it's not just applicable to Physics exams either. In fact, taken to extremes, it's perfectly possible, with a bit of cunning, to change a question about something you haven't revised, into a question about something you *have* revised....

I first realised the power of this at school, in English -
Q2.4 : Is Hamlet really mad? What is the purpose of his madness in the play and by the play? Discuss in not less than 3,000 words.

Answer : In any discussion concerning Hamlet, it is important to note that had he been Scottish, his situation would have been very similar to that of Macbeth....


:-)

1 comment:

Unknown said...

That's a pretty cool technique. Thanks for new weapon in the belt.