So we fired up a console and ran SHOW INNODB STATUS, and saw lots of transactions with long-held locks -
---TRANSACTION 0 200086649, ACTIVE 3000 sec, process no 29791, OS thread id 140353331377936
12 lock struct(s), heap size 3024, 6 row lock(s), undo log entries 10
MySQL thread id 243, query id 314676 ip-10-94-245-79.ec2.internal 10.94.245.79 tn
- however, when we cross-referenced the MySQL thread IDs with a SHOW PROCESSLIST, we found that lots of the threads weren't actually doing anything:
+------+------------+-----------+-----------------+----------------+------+--------------+------------------+
| Id | User | Host | db | Command | Time | State | Info |
+------+------------+-----------+-----------------+----------------+------+--------------+------------------+
| 243 | tn | (..snip..)| tn_production | Sleep | 3000 | | NULL
This was strange - but after a bit of a Googling and a few minutes thought, we realised the cause.
When one of our worker processes dies - e.g. either because monit has killed it for taking up too much resource for too long, or because the EC2 instance it's running on has become unresponsive to ssh and been automatically restarted by our detect-and-rescue script - it may be in the middle of a transaction at the point it's killed.
If the transaction was holding locks, and never got to the "COMMIT" stage, and the connection wasn't closed normally, then that connection will persist - and maintain its locks - until it gets timed-out by the MySQL server.... and the default wait_timeout variable is 28800s - 8 hrs!.
So, we killed the idle connections, and the update queries started going through much more quickly again. We've now brought our wait_timeout variable down to 10 mins, and we'll see how we get on.