Friday, January 04, 2013

MySQL "Row size too large" when saving many text fields

Using MySQL? InnoDB table type? Got a table with several TEXT or BLOB fields? Getting a "Row size too large" error when saving a row with lots of text in those TEXT fields? Confused, because you thought the whole point of TEXT fields was that they stored the text off-table? Well, read on....

disclaimer: if you're storing many text fields in a relational database table, you might want to look again at whether that's the right place and method for storing that data - if it looks like a document and quacks like a document, then hey, maybe a document store would be more appropriate? But that's a whole other topic...

The detail is in the MySQL docs 14.4.5. How InnoDB Stores Variable-Length Columns , but I'll give a quick summary here. It hinges on the file format of your InnoDB engine.

As the docs say:
Early versions of InnoDB used an unnamed file format (now called Antelope) for database files. With that format, tables were defined with ROW_FORMAT=COMPACT (or ROW_FORMAT=REDUNDANT) and InnoDB stored up to the first 768 bytes of variable-length columns (such as BLOB and VARCHAR) in the index record within the B-tree node, with the remainder stored on the overflow pages.
(emphasis mine)

So with the Antelope file format, it's perfectly possible to store a single TEXT field up to 2GB without encountering the "Row size too large" error, but it's not possible to store 10 x 1k TEXT fields - because InnoDB will store the first 768 bytes of each TEXT field on the record itself, and exceed the row size limit of 8192 bytes.

The solution is straightforward (well, mostly :) - change your innodb_file_format variable to Barracuda, and alter the table to use the DYNAMIC or COMPRESSED row_format. This will store the entire contents of the TEXT fields "off-page" -

SET GLOBAL innodb_file_format=Barracuda; SET GLOBAL innodb_file_per_table=ON; ALTER TABLE (your table) ROW_FORMAT=COMPRESSED;

- and you're good to go.

The one complication is if you're running on Amazon RDS, in which case you'll get an error saying you don't have SUPER privileges. If that's the case, you just need to set the innodb_file_format parameter in your RDS instance parameter group, and allow a few seconds for it to propagate to all your instances.

Friday, November 09, 2012

Optimizing Polymorphic Association Joins On STI Models

Lets' say you have a table of assigned_users.

Users can be assigned to several different things, so you make it a polymorphic relationship:
class User < ActiveRecord::Base
  has_many :assigned_users, :as=>:user
end

class AssignedUser < ActiveRecord::Base
  belongs_to :user
  belongs_to :model, :polymorphic=>true
end

class Client < ActiveRecord::Base
  has_many :assigned_users, :as=>:model
end

class Task < ActiveRecord::Base
  has_many :assigned_users, :as=>:model
end


...and so on. At this point you'll have the following fields in your assigned_users table:

id
user_id
model_type
model_id

OK, so this assigned_users table is going to need some indexes. Like any sensible techie, you think about the most common access patterns, and create some indexes to optimize those:

add_index :assigned_users, [:user_id]
add_index :assigned_users, [:model_type, :model_id]

Great, we're good to go. 

Things tick along nicely, your assigned_users table is getting past the 'tiny' stage and into the 'medium' stage of, say, 10,000 records, but we're cool with that,  because the joins have indexes. Right?

Right! Until....

(drum roll)

....one of the parent model tables gets refactored into several sub-classes, with Single Table Inheritance.
 
(badoom.. TSH!)

What happens to your access pattern now?

Well, let's say that you refactor the Task class to have several sub-classes:

class FooTask < Task
end

class BarTask < Task
end

class BazTask < Task
end

When you now try to do a join from the base class Task onto AssignedUsers, you'll get something like this:

(I'm using Ernie Miller's Squeel syntax here, it's nice and clear)

> Task.joins{assigned_users}.to_sql

SELECT `tasks`.* FROM `tasks` INNER JOIN `assigned_users` ON `assigned_users`.`model_id` = `tasks`.`id` AND `assigned_users`.`model_type` IN ('Task', 'FooTask', 'BarTask', 'BazTask')

....and your load times go through the roof, because all of a sudden, that IN(...) clause means that your combined index on model_type and model_id can't be used, so your database resorts to a full table scan. On a 10,000-row table...?

Well, as Egon Spengler would say - "...it would be bad."

Now, we can tackle this several ways:
  1. Hack ActiveRecord to generate multiple ORs on the join rather than an IN(...), or...
  2. Write a scope on the parent model that constructs the join manually, using ORs, or...
  3. Look for a quick and dirty solution
The first 2 sound suspiciously like rabbit holes down which we could quite easily disappear for a while... I like 3 - let's go with 3.

Luckily we can achieve big gains pretty easily, by noting that we could cut down the search space dramatically if we can just get the query optimizer to use an index on the model_id - even if every other class in your model is able to have assigned_users, the maximum number of collisions on model_id you're going to get is the number of classes you have. 

So all we need to do is add an index just on model_id -

add_index :assigned_users, [:model_id]

- and the query optimizer can now use that index in the join. Now it only needs to search through the few records with model_id = N to find those with model_type IN ('Task', 'FooTask', 'BarTask', 'BazTask'), rather than searching through every single record in the table.

Not a perfect solution, admittedly, but a pragmatic one - and one that brought down page load times by 80-90% for us. You're welcome :)