Friday, November 09, 2012

Optimizing Polymorphic Association Joins On STI Models

Lets' say you have a table of assigned_users.

Users can be assigned to several different things, so you make it a polymorphic relationship:
class User < ActiveRecord::Base
  has_many :assigned_users, :as=>:user
end

class AssignedUser < ActiveRecord::Base
  belongs_to :user
  belongs_to :model, :polymorphic=>true
end

class Client < ActiveRecord::Base
  has_many :assigned_users, :as=>:model
end

class Task < ActiveRecord::Base
  has_many :assigned_users, :as=>:model
end


...and so on. At this point you'll have the following fields in your assigned_users table:

id
user_id
model_type
model_id

OK, so this assigned_users table is going to need some indexes. Like any sensible techie, you think about the most common access patterns, and create some indexes to optimize those:

add_index :assigned_users, [:user_id]
add_index :assigned_users, [:model_type, :model_id]

Great, we're good to go. 

Things tick along nicely, your assigned_users table is getting past the 'tiny' stage and into the 'medium' stage of, say, 10,000 records, but we're cool with that,  because the joins have indexes. Right?

Right! Until....

(drum roll)

....one of the parent model tables gets refactored into several sub-classes, with Single Table Inheritance.
 
(badoom.. TSH!)

What happens to your access pattern now?

Well, let's say that you refactor the Task class to have several sub-classes:

class FooTask < Task
end

class BarTask < Task
end

class BazTask < Task
end

When you now try to do a join from the base class Task onto AssignedUsers, you'll get something like this:

(I'm using Ernie Miller's Squeel syntax here, it's nice and clear)

> Task.joins{assigned_users}.to_sql

SELECT `tasks`.* FROM `tasks` INNER JOIN `assigned_users` ON `assigned_users`.`model_id` = `tasks`.`id` AND `assigned_users`.`model_type` IN ('Task', 'FooTask', 'BarTask', 'BazTask')

....and your load times go through the roof, because all of a sudden, that IN(...) clause means that your combined index on model_type and model_id can't be used, so your database resorts to a full table scan. On a 10,000-row table...?

Well, as Egon Spengler would say - "...it would be bad."

Now, we can tackle this several ways:
  1. Hack ActiveRecord to generate multiple ORs on the join rather than an IN(...), or...
  2. Write a scope on the parent model that constructs the join manually, using ORs, or...
  3. Look for a quick and dirty solution
The first 2 sound suspiciously like rabbit holes down which we could quite easily disappear for a while... I like 3 - let's go with 3.

Luckily we can achieve big gains pretty easily, by noting that we could cut down the search space dramatically if we can just get the query optimizer to use an index on the model_id - even if every other class in your model is able to have assigned_users, the maximum number of collisions on model_id you're going to get is the number of classes you have. 

So all we need to do is add an index just on model_id -

add_index :assigned_users, [:model_id]

- and the query optimizer can now use that index in the join. Now it only needs to search through the few records with model_id = N to find those with model_type IN ('Task', 'FooTask', 'BarTask', 'BazTask'), rather than searching through every single record in the table.

Not a perfect solution, admittedly, but a pragmatic one - and one that brought down page load times by 80-90% for us. You're welcome :)