Instant Badger: activerecord

Showing posts with label activerecord. Show all posts

Friday, November 09, 2012

Optimizing Polymorphic Association Joins On STI Models

Lets' say you have a table of assigned_users.

Users can be assigned to several different things, so you make it a polymorphic relationship:

class User < ActiveRecord::Base
  has_many :assigned_users, :as=>:user
end

class AssignedUser < ActiveRecord::Base
  belongs_to :user
  belongs_to :model, :polymorphic=>true
end

class Client < ActiveRecord::Base
  has_many :assigned_users, :as=>:model
end

class Task < ActiveRecord::Base
  has_many :assigned_users, :as=>:model
end

...and so on. At this point you'll have the following fields in your assigned_users table:

user_id

model_type

model_id

OK, so this assigned_users table is going to need some indexes. Like any sensible techie, you think about the most common access patterns, and create some indexes to optimize those:

add_index :assigned_users, [:user_id]

add_index :assigned_users, [:model_type, :model_id]

Great, we're good to go.

Things tick along nicely, your assigned_users table is getting past the 'tiny' stage and into the 'medium' stage of, say, 10,000 records, but we're cool with that, because the joins have indexes. Right?

Right! Until....

(drum roll)

....one of the parent model tables gets refactored into several sub-classes, with Single Table Inheritance.

(badoom.. TSH!)

What happens to your access pattern now?

Well, let's say that you refactor the Task class to have several sub-classes:

class FooTask < Task

end

class BarTask < Task

end

class BazTask < Task

end

When you now try to do a join from the base class Task onto AssignedUsers, you'll get something like this:

(I'm using Ernie Miller's Squeel syntax here, it's nice and clear)

> Task.joins{assigned_users}.to_sql

SELECT `tasks`.* FROM `tasks` INNER JOIN `assigned_users` ON `assigned_users`.`model_id` = `tasks`.`id` AND `assigned_users`.`model_type` IN ('Task', 'FooTask', 'BarTask', 'BazTask')

....and your load times go through the roof, because all of a sudden, that IN(...) clause means that your combined index on model_type and model_id can't be used, so your database resorts to a full table scan. On a 10,000-row table...?

Well, as Egon Spengler would say - "...it would be bad."

Now, we can tackle this several ways:

Hack ActiveRecord to generate multiple ORs on the join rather than an IN(...), or...
Write a scope on the parent model that constructs the join manually, using ORs, or...
Look for a quick and dirty solution

The first 2 sound suspiciously like rabbit holes down which we could quite easily disappear for a while... I like 3 - let's go with 3.

Luckily we can achieve big gains pretty easily, by noting that we could cut down the search space dramatically if we can just get the query optimizer to use an index on the model_id - even if every other class in your model is able to have assigned_users, the maximum number of collisions on model_id you're going to get is the number of classes you have.

So all we need to do is add an index just on model_id -

add_index :assigned_users, [:model_id]

- and the query optimizer can now use that index in the join. Now it only needs to search through the few records with model_id = N to find those with model_type IN ('Task', 'FooTask', 'BarTask', 'BazTask'), rather than searching through every single record in the table.

Not a perfect solution, admittedly, but a pragmatic one - and one that brought down page load times by 80-90% for us. You're welcome :)

Monday, September 03, 2012

Getting Accurate Count on Grouped Arel Relations with Squeel

Let's say you have a scope on a model that joins onto potentially several related models:


class Post < ActiveRecord::Base
  has_many :comments

  scope :commented_on_by, lambda{ |user|
    .joins(:comments).where( :comments=>{:author_id => user.id} )
  } 
end

( OK, so this is a contrived artificial example, there are better ways of doing this - but the real example which triggered this post is way too complex to go into in detail, and we're all familiar with blogs, posts and comments, right? )

So if you call Post.commented_on_by(user) you'll get all the posts which the given user has commented on. All well and good. BUT - what if the same user comments multiple times on the same post? You'll then get the same post repeated as many times in your resultset.

So the logical thing to do here is to introduce a group by clause, to make sure you only get each post once, right?


class Post < ActiveRecord::Base
  has_many :comments

  scope :commented_on_by, lambda{ |user|
    .joins(:comments).where( :comments=>{:author_id => user.id} ).group(:id)
  } 
end

OK, that's that problem solved. Except.... what if you now want to run an aggregate query over that result set? Say, you want to count the number of posts the user has commented on?


> Post.commented_on_by(user).count
> {1=>2, 2=>1, 3=>2}

Er... whut? We get a Hash?

Well yes - because we've grouped the query, ActiveRecord will group the count results as well. It's saying "post id 1 has two comments by this user, post id 2 has one, and post id 3 has two". All of which is entirely correct, but a bit annoying if you just want to get the number of posts. OK, you could do Post.commented_on_by(user).count.size, but that kind of defeats the intended purpose of a count call, right?

The fundamental problem is the underlying structure of your query. You're essentially collapsing multiple rows into one with the group clause, and your SQL parser interprets it as "get me the count of rows for each row BEFORE it's collapsed" rather than after.

So, how do we get it to treat these multiple rows per-post as if they were one? IN() is your friend!

The good news is, you can get round this using Ernie Miller's rather excellent gem Squeel. We've been using Squeel a lot at iTrigga, as it helps us keep on top of syntax for some pretty complex queries.

One of the many nice things about Squeel is that it lets you supply ActiveRecord::Relation objects as a value for a predicate - so if you can restructure your query to use an IN() clause, you can supply your already-built Relation object as the value to it.

In this case, you can rewrite:


> Post.commented_on_by(user).count


> Post.where{ id.in(Post.commented_on_by(user).select(:id)) }.count

and the result should be correct

Wednesday, August 12, 2009

OutOfMemoryError in ActiveRecord-JDBC on INSERT SELECT

During some scale testing the other day, we came across this unusual / mildly amusing error in a database-bound command that just funnels INSERT SELECT statements down the ActiveRecord JDBC driver:


java/util/Arrays.java:2734:in `copyOf': java.lang.OutOfMemoryError: Java heap space (NativeException)
 from java/util/ArrayList.java:167:in `ensureCapacity'
 from java/util/ArrayList.java:351:in `add'
 from com/mysql/jdbc/StatementImpl.java:1863:in `getGeneratedKeysInternal'
 from com/mysql/jdbc/StatementImpl.java:1818:in `getGeneratedKeys'
 from org/apache/commons/dbcp/DelegatingStatement.java:318:in `getGeneratedKeys'
 from jdbc_adapter/JdbcAdapterInternalService.java:668:in `call'
 from jdbc_adapter/JdbcAdapterInternalService.java:241:in `withConnectionAndRetry'
 from jdbc_adapter/JdbcAdapterInternalService.java:662:in `execute_insert'
  ... 25 levels...

Now, there's a couple of things here that are worth pointing out.

I REALLY LOVE the fact that it blew heap space in a method called ensureCapacity. That makes me smile.
Why is it calling getGeneratedKeys() for an INSERT SELECT?

The getGeneratedKeys() method retrieves all the primary keys that are generated when you execute an INSERT statement. Fair enough - BUT the issue here is that we'd specifically structured the process and the SQL statements involved so as to be done with INSERT SELECTS, and hence avoid great chunks of data being transferred backwards and forwards between the app and the database.

It turns out that the ActiveRecord JDBC adapter is doing this :
(lib/active_record/connection_adapters/JdbcAdapterInternalService.java)


@JRubyMethod(name = "execute_insert", required = 1)
    public static IRubyObject execute_insert(final IRubyObject recv, final IRubyObject sql) throws SQLException {
        return withConnectionAndRetry(recv, new SQLBlock() {
            public IRubyObject call(Connection c) throws SQLException {
                Statement stmt = null;
                try {
                    stmt = c.createStatement();
                    smt.executeUpdate(rubyApi.convertToRubyString(sql).getUnicodeValue(), Statement.RETURN_GENERATED_KEYS);
                    return unmarshal_id_result(recv.getRuntime(), stmt.getGeneratedKeys());
                } finally {
                    if (null != stmt) {
                        try {
                            stmt.close();
                        } catch (Exception e) {
                        }
                    }
                }
            }
        });
    }

...in other words, explicitly telling the driver to return all the generated keys.
Hmm, OK, can we get round this by NOT calling the execute_insert method, and instead calling a raw execute method that doesn't return all the keys?

Well, no, unfortunately, because it also turns out that the ruby code is doing this:
(activerecord-jdbc-adapter-0.9/lib/active_record/connection_adapters/jdbc_adapter.rb)


      # we need to do it this way, to allow Rails stupid tests to always work
      # even if we define a new execute method. Instead of mixing in a new
      # execute, an _execute should be mixed in.
      def _execute(sql, name = nil)
        if JdbcConnection::select?(sql)
          @connection.execute_query(sql)
        elsif JdbcConnection::insert?(sql)
          @connection.execute_insert(sql)
        else
          @connection.execute_update(sql)
        end
      end

...and the JdbcConnection::insert? method is detecting if something's an insert by doing this:
(JdbcAdapterInternalService.java again)


@JRubyMethod(name = "insert?", required = 1, meta = true)
    public static IRubyObject insert_p(IRubyObject recv, IRubyObject _sql) {
        ByteList bl = rubyApi.convertToRubyString(_sql).getByteList();

        int p = bl.begin;
        int pend = p + bl.realSize;

        p = whitespace(p, pend, bl);

        if(pend - p >= 6) {
            switch(bl.bytes[p++]) {
            case 'i':
            case 'I':
                switch(bl.bytes[p++]) {
                case 'n':
                case 'N':
                    switch(bl.bytes[p++]) {
                    case 's':
                    case 'S':
                        switch(bl.bytes[p++]) {
                        case 'e':
                        case 'E':
                            switch(bl.bytes[p++]) {
                            case 'r':
                            case 'R':
                                switch(bl.bytes[p++]) {
                                case 't':
                                case 'T':
                                    return recv.getRuntime().getTrue();
                                }
                            }
                        }
                    }
                }
            }
        }
        return recv.getRuntime().getFalse();
    }

...in other words, if the sql contains the word INSERT, then it's an INSERT, and should be executed with an execute_insert call.

So, looks like we're a bit knacked here. There are two possible solutions:

The proper solution - fix the AR JDBC adapter (and, arguably, the MySQL connector/J as well, to stop it blowing heap space), submit a patch, wait for it to be accepted and make it into the next release.

Or, the pragmatic solution - rewrite the SQL-heavy command as a stored procedure and just call it with an EXECUTE and sod the DHH dogma.

We went with option 2 :)

Big kudos to D. R. MacIver for tracking down the source of the fail in double-quick time.

Friday, May 22, 2009

This should not be possible - transactional fail?

We're having a bizarre problem with MySQL 5.0.32 on Debian in a highly-concurrent environment, using the ActiveRecord JDBC adapter. We've also seen it on 5.0.51 on Ubuntu.

There's a particular sequence of 3 queries, all executed from within a ActiveRecord::Base.transaction { } block:


// Statement 1:
  INSERT INTO (InnoDB table 1)
         (type, ...loads of other fields...)
  SELECT 'ProtoMessageItem', ...other fields...
  FROM   (MyISAM table 1) LEFT OUTER JOIN (InnoDB table 2)
  WHERE  (anti-duplication clause)
  GROUP BY (synthetic identifier on MyISAM table 1, to be a unique key on InnoDB table 1)
 
// Statement 2:
  INSERT INTO (InnoDB table 2)
  SELECT (fields)
  FROM   (MyISAM table 1) LEFT OUTER JOIN (InnoDB table 3) INNER JOIN (InnoDB table 1)
  WHERE  (anti-duplication clause)

// Statement 3:
  UPDATE  (InnoDB table 1)
  SET     type = 'MessageItem' 
  WHERE   type = 'ProtoMessageItem'

Now, all of this takes place within an explicit transaction, and works well - this approach provides huge performance gains over the more simple and obvious methods.

The problem is that just very very occasionally, it breaks. We don't know how or why, but once in a blue moon, statement 2 throws a DUPLICATE KEY exception, and the transaction DOESN'T get rolled back. Statement 1 still gets COMMITTED. We can see ProtoMessageItems in the DB, and this blocks up the pipeline and stops further MessageItems from being created.

ProtoMessageItems should NEVER be visible - they're only ever created and then updated to full MessageItems in the same 3-statement transaction. Seeing them from outside that transaction is like seeing a naked singularity - it just shouldn't be possible.

This is all kinds of wrong in a very wrong way, as it means one of the following - either:

The DB is getting the correct BEGIN/COMMIT/ROLLBACK statements, but transactional atomicity is not holding for some reason

The JDBC driver is not sending the correct ROLLBACK statements, but thinks it's in autocommit mode

There's an obscure concurrency edge case somewhere with the ActiveRecord JDBC adapter, that means once in a while, for some reason, it starts that transaction in autocommit mode.

All of these possibilities are worrying. We've eliminated the possibility of it being some quirk of the particular data that causes this problem. We've turned on MySQL statement logging and watched it for hours, and as far as we can see, the correct ROLLBACK statements seem to be issued every time.

Has anyone encountered a similar problem? Any known issues with INSERT/SELECT on combinations of (transactional) InnoDB and (non-transactional) MyISAM tables?

Tuesday, July 31, 2007

Rails Migration Version Headache - D'oh!

I've been having real headbanging frustrations with Rails ActiveRecord migrations not picking up the correct target version. On the face of it, this should be about as simple as you can get -

- To determine the current version, the migrate task looks in your database for a table called "schema_info", and reads the value of the "version" field from it. If the table is not there, or is empty, then version is nil, and nil.to_i = 0.

- To determine the target version, the task gets the highest-numbered file from the db/migrate directory and parses an integer out of the filename.

OR

- You can override the target version by passing in a command-line parameter like so:
rake db:migrate VERSION=XXX

So, as the famous "outraged of Tunbridge Wells" might say, why oh why oh why was my migration task always reverting back to previous versions, despite having migration files numbered sequentially right the way up to version 19 (and counting) ?

In vain did I bang my head against the desk, curse migrations and all they stood for, implore the heavens to rain down brimstone upon the head of DHH and all his acolytes, and impugn the parentage of my PC, Windows, Oracle and databases in general....

On a trawl through the Railties source code, it turns out there's a third, more surreptitious way to tell the migrate task which version you'd like to migrate to - if there's an environment variable called VERSION, it will take the value from that. And when I looked at my System Variables in XP, what did I find?

VERSION=3.5.0

Ah.

OK.

My bad...

Instant Badger