Thursday, June 28, 2007

A Four-Way Grudge Match: Oracle vs. MySQL vs. Views vs. Derived Tables, eek..!

I like Views. SQL Views are great. Unless, that is, you have to support certainly two (MySQL and Oracle) and ideally three (plus SQLServer) of the major DB systems with the same code. Yesterday I was forced into a dastardly hack that made me wince with it's ickiness, but was actually the most practical solution given the constraints.

The Situation

We have a view which combines a LEFT OUTER JOIN with an AVG and a GROUP BY, to "collapse" what may be many related records down in to one. The AVG gets an average link weighting, to give you an effective measure of how closely two items are related by all the various different linking mechanisms.

This view references a table - item_links - that is expected to grow very rapidly, to the size of many millions of records.

We INNER JOIN onto this view in application code, to get item records related to the "current" item.


The Problem

On Oracle, this view works really well. Nice and fast, at most a tenth of a second or so per query. Doing an EXPLAIN in Oracle's SQL Developer shows that the execution plan uses all the expected indexes. Luvvly jubbly, bosh, sorted, etc.

On MySQL, however, view support is much less mature. The query optimiser can only use the underlying indexes if the view is created with CREATE ALGORITHM=MERGE and the docs say:
"The MERGE algorithm requires a one-to-one relationship between the rows in the view and the rows in the underlying table. If this relationship does not hold, a temporary table must be used instead. Lack of a one-to-one relationship occurs if the view contains any of a number of constructs:
  • Aggregate functions (SUM(), MIN(), MAX(), COUNT(), and so forth)
  • DISTINCT
  • GROUP BY

...

... so instead, what MySQL does is select all rows into a temporary table, which is then used for executing the statement against.

Doing an EXPLAIN in MySQL shows that for the crucial operation of selecting the required rows from this potentially-huge table, there are no indexes used AT ALL. It is in fact selecting EVERY row into the temp table.


All of which leaves me - and, I must say, in blatant defiance of all known precedent - in the truly strange position of cursing MySQL whilst praising Oracle.....

Which worries me - I've never been here before.... :-)


The Solutions That Won't Work

  1. Get rid of the view and just put the sql into each query,
    i.e. change INNER JOIN vw_item_links to INNER JOIN item_links ON (blah) LEFT OUTER JOIN link_prefs ON (blah) etc etc
    Hmmm.... it's possible, but I would say it's a last resort, because that would mean that we'd then have to GROUP BY every single field in every query, not just the fields in the view. This can get majorly cumbersome, because although MySQL allows you to just group by just-enough-fields-to-get-uniqueness (and thus gets a yay for ease-of-use), Oracle insists on grouping by every single field that's not an aggregate (and just about gets a begrudging yay for insisting on standards compliance)

  2. Eliminate the view and the GROUP BY through using subqueries
    i.e. SELECT (item fields), (SELECT AVG(weight)  FROM item_links WHERE ...) AS weight, (SELECT other_item_id)
    This is a bit of a null option, unfortunately, as we'd still have to INNER JOIN onto item_links in the FROM clause to get the related records, so we're not really gaining anything, and in fact we'd be introducing more overhead through a correlated subquery executed for every row.

So what else can we try? Anyone? Ah yes, you boy, there at the back....


The Solution That Did Work

You can, in fact, use subqueries in the FROM clause to create DERIVED TABLES, like so:
SELECT (fields) 
FROM items
INNER JOIN ( your view definition SQL ) inline_vw_item_links
ON inline_vw_item_links.other_item_id = items.id
WHERE inline_vw_item_links.item_id = ...
ORDER BY inline_vw_item_links.weight DESC

in effect, putting your view SQL into the query as if it were a regular table

This does mean that you lose some of the nice encapsulation of the view, and if the view definition is complex it makes your queries look ugly, but at least it works across both DB platforms. You can also get round the ickiness problem by holding the view definition SQL in a variable :

SELECT (fields)
FROM items
INNER JOIN ( #getViewDefinitionSql()# ) inline_vw_item_links
ON inline_vw_item_links.other_item_id = items.id
WHERE inline_vw_item_links.item_id = ...
ORDER BY inline_vw_item_links.weight DESC


But hang on, there's one more subtlety - (and this is the icky bit)

An EXPLAIN on Oracle shows that this query still uses all the appropriate indexes, as it should. Smashin'.

But MySQL, although it's now recognising that there are indexes there to be used, is still selecting ALL rows from item_links. Which, to be fair, is what the query is *strictly* telling it to do.

So here's the icky hack:

We put the item_links WHERE clause restrictions into the inline view definition SQL, so that it looks like this:

SELECT (fields) 
FROM items
INNER JOIN (
SELECT AVG(weight) AS weight, other_item_id, (etc...)
FROM item_links LEFT OUTER JOIN (blah) ON (whatever)
WHERE item_links.item_id = #my_item_id# AND shared=1
) inline_vw_item_links
ON inline_vw_item_links.other_item_id = items.id
WHERE inline_vw_item_links.item_id = ...
ORDER BY inline_vw_item_links.weight DESC


which you can then prettify a bit for encapsulation like this:

SELECT (fields) 
FROM items
INNER JOIN (
#getViewDefinitionSql( "WHERE item_links.item_id = #my_item_id# AND shared=1" )
) inline_vw_item_links
ON inline_vw_item_links.other_item_id = items.id
WHERE inline_vw_item_links.item_id = ...
ORDER BY inline_vw_item_links.weight DESC


Now this makes me feel icky. It's (effectively) mixing your selection criteria in with your join criteria, which violates my rules for "good" sql, and goes against what I've always thought of as best practise - letting the DB engine's query optimiser work out the execution plan for itself. And it's just...wrong...

...But it works. An EXPLAIN shows that MySQL is now using the indexes that it should be using, and only selecting the rows that it needs from the potentially-many-millions-of-rows item_links tables. And the benefit is blatantly obvious in execution times, turning what was a several-second query execution time into a few tenths of a second. And Oracle is still happy. I suspect that it was previously optimising the view SQL into effectively this query anyway, under the hood. So I'm swallowing my SQL purist pomposity, and going with it. But if anyone can think of a more "proper" way of doing this, I'm all ears.

2 comments:

FixedXorBroken said...

you're smart. nice query.

Alistair Davidson said...

Hah - thanks, but i'm not that smart - just bloody minded :)