Instant Badger: 2012

Friday, November 09, 2012

Optimizing Polymorphic Association Joins On STI Models

Lets' say you have a table of assigned_users.

Users can be assigned to several different things, so you make it a polymorphic relationship:

class User < ActiveRecord::Base
  has_many :assigned_users, :as=>:user
end

class AssignedUser < ActiveRecord::Base
  belongs_to :user
  belongs_to :model, :polymorphic=>true
end

class Client < ActiveRecord::Base
  has_many :assigned_users, :as=>:model
end

class Task < ActiveRecord::Base
  has_many :assigned_users, :as=>:model
end

...and so on. At this point you'll have the following fields in your assigned_users table:

user_id

model_type

model_id

OK, so this assigned_users table is going to need some indexes. Like any sensible techie, you think about the most common access patterns, and create some indexes to optimize those:

add_index :assigned_users, [:user_id]

add_index :assigned_users, [:model_type, :model_id]

Great, we're good to go.

Things tick along nicely, your assigned_users table is getting past the 'tiny' stage and into the 'medium' stage of, say, 10,000 records, but we're cool with that, because the joins have indexes. Right?

Right! Until....

(drum roll)

....one of the parent model tables gets refactored into several sub-classes, with Single Table Inheritance.

(badoom.. TSH!)

What happens to your access pattern now?

Well, let's say that you refactor the Task class to have several sub-classes:

class FooTask < Task

end

class BarTask < Task

end

class BazTask < Task

end

When you now try to do a join from the base class Task onto AssignedUsers, you'll get something like this:

(I'm using Ernie Miller's Squeel syntax here, it's nice and clear)

> Task.joins{assigned_users}.to_sql

SELECT `tasks`.* FROM `tasks` INNER JOIN `assigned_users` ON `assigned_users`.`model_id` = `tasks`.`id` AND `assigned_users`.`model_type` IN ('Task', 'FooTask', 'BarTask', 'BazTask')

....and your load times go through the roof, because all of a sudden, that IN(...) clause means that your combined index on model_type and model_id can't be used, so your database resorts to a full table scan. On a 10,000-row table...?

Well, as Egon Spengler would say - "...it would be bad."

Now, we can tackle this several ways:

Hack ActiveRecord to generate multiple ORs on the join rather than an IN(...), or...
Write a scope on the parent model that constructs the join manually, using ORs, or...
Look for a quick and dirty solution

The first 2 sound suspiciously like rabbit holes down which we could quite easily disappear for a while... I like 3 - let's go with 3.

Luckily we can achieve big gains pretty easily, by noting that we could cut down the search space dramatically if we can just get the query optimizer to use an index on the model_id - even if every other class in your model is able to have assigned_users, the maximum number of collisions on model_id you're going to get is the number of classes you have.

So all we need to do is add an index just on model_id -

add_index :assigned_users, [:model_id]

- and the query optimizer can now use that index in the join. Now it only needs to search through the few records with model_id = N to find those with model_type IN ('Task', 'FooTask', 'BarTask', 'BazTask'), rather than searching through every single record in the table.

Not a perfect solution, admittedly, but a pragmatic one - and one that brought down page load times by 80-90% for us. You're welcome :)

Monday, September 24, 2012

Implementing Agile - When User Stories Meet Ticketing Systems

There's lots of "Introduction to Agile principles"-type articles around the blogosphere, but not many I've seen that actually give practical advice down to the level of the real mundanities like ticket-logging. But these things are important - if you're not careful, your ticketing system can end up like an elephants graveyard, a confusing, mournful desolate place full of orphaned tickets for which no-one can remember what they're doing there, whether they're still needed, or even why they were logged in the first place.

So I thought I'd share the way we've been doing it. I'm going to concentrate here on feature requests only, as bug-reporting is a whole separate issue. The system we've been trying recently at iTrigga has helped us keep on top of our estimates and project mgmt, as well as our daily tasks. So here goes:

1) We agree a formal user story with the requesting person

e.g. don't just record "Ability to get a second opinion on an article", go the extra small step of wording in the form

"As a (type of user) I want to (feature) so that (benefit)"

in this example, it would be "As an editor, I want to get a second opinion on an article so that when I'm editing an article that requires specialist technical knowledge I don't have, I can get it checked by a domain expert"

Getting the level of detail right can take a little bit of getting used to, but is well worth it in the long run. The "...so that (benefit)" may sound trivial or obvious, but it's absolutely vital when you refer back to lists of tickets long after the initial conversation has faded from memory.

We've also found it very useful to agree this with the requesting person. It can be quite trying to get the rest of the business to adopt a standard format for this - the push-back is almost inevitable ("You put it into your system however you like, I don't have time for this!") but you can use it as a double-check ("..so let me just make sure I've got this straight - is it fair to say As a sales manager, you want a report of job progress so that you can estimate your booked revenue for this month?") and if you keep repeating the same format back to them, gradually they'll start using it themselves :)

2) We log a ticket for each user story

The ticket name is the "As a (type of user) I want to (feature)", the first line of the description is the "..so that (benefit)."

After that, we can add however much more context and supporting information we need, but the title and first line always follow that format so that we can always see:

what was requested
who it was for, and
why they wanted it

3) We log sub-tasks for each change required to fulfill the user story

Let's take a quick whizz through the example above - what do we need to add or change in order to satisfy the "ability to get a second opinion on an article" ? Well, we'll need to:

add a button marked "Get a second opinion" on the editing page
add a way of specifying (or automatically choosing) who to get the second opinion from
send an email to the second opinion guy telling them that their input is needed
record their opinion in some way (another form? comments?)
an email sending their opinion back to the editor who originally requested the second opinion

OK, at that level, we can log sub-tasks and probably start estimating time requirements for these tasks.

( Or can we? Well, there's a couple there that need more detail - 2 and 4. So there'll be some discussion around those, and maybe even another level of sub-task below those two, or maybe a separate user story. But let's assume for the purposes of this post that we can move on.)

4) Each sub-task is logged in the form of RSpec examples

For instance, sub-task number 2 above should be logged as something like this:

"when a second opinion is requested, it should send a SecondOpinionRequested email to the chosen domain expert"

- which translates nicely into a functional spec:


describe "when a second opinion is requested" do
  before do
    # set up your situation
  end

  it "generates a message" do
    # e.g. something like this:
    expect{ subject.save }.to change( Message, :count ).by(1)
  end

  describe "the generated message" do
    let(:msg){ Message.last }

    it "is a SecondOpinionRequested mail" do
      # your test here
    end 

    it "goes to the chosen recipient" do
      # your test here
    end
  end
end

So this gives us a top-level task for the user story, and a set of RSpec examples for the changes needed to fulfil the user story. This is really handy when you come back to a half-complete job later on and want to check how much of it is still left to do.

We use Redmine for our ticketing, and there are a couple of extra plusses when you do it this way in Redmine:

estimates for the sub-tasks propagate up to the top-level task (the user story)
you can't complete a parent task until the sub-tasks are complete
a parent task can't be higher-priority than its lowest-priority sub-task
if you complete each sub-task as you get the spec to pass, you can easily see from the parent task how much is left to do (see figure)

5) Each top-level user story gets its own git branch

...and the branch name is (ticket #)-brief-feature-description.

It's tempting to just dive in and start coding, but a little discipline here pays off big dividends in the long run. I've lost count of the number of times I've had to break off even ostensibly five-minute jobs for any of the following:

You're halfway through the ticket when the CEO tells you to drop everything and work on X instead
You realise that you can't do this until ticket Y is done, and that guy's away
You have to break off and do an urgent fix
You have to look at an issue with someone else's unreleased code

etc etc etc. Whatever - remember this:

in Git, branches are cheap - there's no reason NOT to use them!

Putting the ticket number at the start of the branch name helps keep things in a sensible order when you list the branches, and keeping the feature description in there means you don't have to keep referring back to your ticketing system to remember which is which.

It also helps when trying to trace exactly which feature got merged into what, when.

That's probably enough for now - I'm sure there are many alternative systems, this is just one that works for us. All suggestions and comments welcome!

Monday, September 03, 2012

Getting Accurate Count on Grouped Arel Relations with Squeel

Let's say you have a scope on a model that joins onto potentially several related models:


class Post < ActiveRecord::Base
  has_many :comments

  scope :commented_on_by, lambda{ |user|
    .joins(:comments).where( :comments=>{:author_id => user.id} )
  } 
end

( OK, so this is a contrived artificial example, there are better ways of doing this - but the real example which triggered this post is way too complex to go into in detail, and we're all familiar with blogs, posts and comments, right? )

So if you call Post.commented_on_by(user) you'll get all the posts which the given user has commented on. All well and good. BUT - what if the same user comments multiple times on the same post? You'll then get the same post repeated as many times in your resultset.

So the logical thing to do here is to introduce a group by clause, to make sure you only get each post once, right?


class Post < ActiveRecord::Base
  has_many :comments

  scope :commented_on_by, lambda{ |user|
    .joins(:comments).where( :comments=>{:author_id => user.id} ).group(:id)
  } 
end

OK, that's that problem solved. Except.... what if you now want to run an aggregate query over that result set? Say, you want to count the number of posts the user has commented on?


> Post.commented_on_by(user).count
> {1=>2, 2=>1, 3=>2}

Er... whut? We get a Hash?

Well yes - because we've grouped the query, ActiveRecord will group the count results as well. It's saying "post id 1 has two comments by this user, post id 2 has one, and post id 3 has two". All of which is entirely correct, but a bit annoying if you just want to get the number of posts. OK, you could do Post.commented_on_by(user).count.size, but that kind of defeats the intended purpose of a count call, right?

The fundamental problem is the underlying structure of your query. You're essentially collapsing multiple rows into one with the group clause, and your SQL parser interprets it as "get me the count of rows for each row BEFORE it's collapsed" rather than after.

So, how do we get it to treat these multiple rows per-post as if they were one? IN() is your friend!

The good news is, you can get round this using Ernie Miller's rather excellent gem Squeel. We've been using Squeel a lot at iTrigga, as it helps us keep on top of syntax for some pretty complex queries.

One of the many nice things about Squeel is that it lets you supply ActiveRecord::Relation objects as a value for a predicate - so if you can restructure your query to use an IN() clause, you can supply your already-built Relation object as the value to it.

In this case, you can rewrite:


> Post.commented_on_by(user).count


> Post.where{ id.in(Post.commented_on_by(user).select(:id)) }.count

and the result should be correct

Thursday, May 24, 2012

SEO Spammer Trolling

I recently got this SEO Spam mail:

From:  Katie 
Subject:  Web Proposal
Date:  24 May 2012 13:07:08 GMT+01:00
To:  iTrigga Support 

Hi,

I hope you're doing good!

We are suffering from your site: www.itrigga.com saw your website is not ranked on any search engines.

If you are interested we want to increase the number of visitors to your website, it is important that you have a top search engine position.

Our search engine optimization experts will run a ranking report showing you exactly where your website currently stands in all the major search engines. Then we will email you our analysis report along with the recommendations of how we can increase your ranking, and improve your websites traffic dramatically!

We strictly work on performance basis and can assure you of getting quality links with a proper reporting format for your site as well.

We wish you the best of luck and looking forward to a long and healthy business relationship with you and your company.

Please do let me know if you have any questions.

Kind Regards,

Name: -Katie
Post:-Business Development Manager
Reply me:- Katie@onlinewebfurnisher.com

Note: Though this is not an automated email, we are sending these emails to all those people whom we find eligible of using our services. To unsubscribe from future mails (i.e., to ensure that we do not contact you again for this matter), please send a blank email at removeallmailsplease@gmail.com

Being in a slightly warped & mischievous mood, having just this morning received tickets to see Faith No More and then spent the next hour listening to them on Spotify, I thought I'd take a look at onlinewebfurnisher.com:

Oh dear. The search engine optimization experts evidently had a bit of a problem with their own website. No content, for a start, and everyone knows that the ultimate SEO technique is to have good content, right?. So I thought I'd send her a reply:

From:  Al Davidson 
Subject:  Re: Web Proposal
Date:  24 May 2012 13:27:39 GMT+01:00
To:  Katie 

Hi,

I'm good thanks, I hope you're doing good!

We are suffering from your site: www.onlinewebfurnisher.com saw your website does not have any content on it. (see screenshot)

If you are interested we want to increase the number of visitors to your website, it is important for Search Engine Optimization that you have good (or at the very least, ANY) content!

Our content experts will run an analysis showing you exactly what content strategy will work best for your business needs. Then we will email you our propsal along with the recommendations of how we can increase your ranking, and improve your websites traffic dramatically!

We can assure you of getting quality content with a proper format for your site as well.

We wish you the best of luck and looking forward to a long and healthy business relationship with you and your company.

Please do let me know if you have any questions.

Kind Regards,

Name: -Al
Post:- CTO
Note: Though this is not an automated email, we are sending these emails to all those people whom we find eligible of using our services. To unsubscribe from future mails (i.e., to ensure that we do not contact you again for this matter), please send a blank email at removeallmailsplease@itrigga.com.

-- 
Al Davidson
CTO

Hyuk hyuk hyuk.... we shall see what response I get

Monday, February 06, 2012

Errno::EPIPE: Broken pipe when accessing S3 ?

Quick tech tip - if you're trying to access Amazon S3 from Ruby, even with the official aws-sdk gem, and you get errors like this: Errno::EPIPE: Broken pipe - when trying to upload, the issue is probably that you need to explicitly set the endpoint appropriately for the region of the bucket you're trying to access. By default, the endpoint is US-specific (US-EAST, I believe) Yes, I know the docs say you don't need to, but try it - if, like me, you find your problem instantly goes away, then Robert is your mothers' brother, as they say. Well, they do round here, anyway. You might find this comprehensive list of AWS endpoints useful. Oh, and it's not immediately obvious from the docs how to set the endpoint - I found it easiest to pass in a :s3_endpoint param when initialising the S3 object, like so -


 my_s3_object = AWS::S3.new( :s3_endpoint => 's3-eu-west-1.amazonaws.com', :access_key_id=>'my access key', :secret_access_key=>'my secret access key')

Monday, January 23, 2012

How many WTFs per language?

Recently I've had to work with PHP an increasing amount. I've not been enjoying it. A few recent IRC conversations also got me thinking - is there an entirely arbitrary but kind of fun (and hey aren't they all arbitrary, really?) metric of programming language FAIL / coder FAILs per-language ?

So on a whim, I decided to invent one: number of entries on The Daily WTF which mention a given language.

Here we go - "web" languages only, numbers correct at time of writing:

.NET - 13,000
Java - 11,300
PHP - 8,100
Javascript - 6,020
Perl - 3,520
Python - 2,270
Ruby - 1,570
Coldfusion - 459

Or in chart form:

So what does this prove? Well, nothing - to draw any conclusions at all, we'd have to normalise this for all kinds of factors such as

number of coders / lines of code using each language (anyone have a decent set of figures?)
average experience level of coder (I'm pretty sure that more beginners use PHP than Perl, for instance)
you name it

...but it is kind of fun. Let the flamewar ~~commence~~ continue!

Tuesday, January 10, 2012

FeedWordpress duplicate posts with YD Domain Mapping

Using FeedWordPress to pull in aggregated feeds to a Wordpress blog? Also using the Wordpress MU Domain Mapping plugin to map an arbitrary domain to your blog? Seeing posts get duplicated? Then read on ....

I've isolated the duplicate posts issue to some kind of interaction with the Wordpress MU Domain Mapping plugin. We're using this so that we can map arbitrary domains to our network sites.

E.g.

our Wordpress Network is set up as wp-network.foo.com, on a subdomain setup
a network site 'bar' would then be bar.wp-network.foo.com
we setup the feedwordpress plugin to pull in posts from a tag:uri feed
we get unique posts in, with GUIDs of the form bar.wp-network.foo.com?guid=(some guid)

All well and good. BUT, then we use the domain mapping plugin to map blog.bar.com to that network site...

It seems that when we hit a network site on its mapped domain (blog.bar.com), it instantly duplicates the syndicated posts with guids of the form blog.bar.com?guid=(the same guid)

I set up 2 identical blogs, subscribed them both to the same feed, and applied a domain mapping to one but not the other. All was fine until I hit the blog homepage on the new mapped domain- at which point all the posts got replicated with mapped domain GUIDs as above.

So I don't know if this is an issue with FeedWordPress or with the Wordpress MU Domain Mapping plugin, but they definitely don't seem to play nicely with each other.

Hope this helps someone!

Instant Badger