Concurrent Processing And PHP+MySQL, And Challenges Abound

…lets assume I have a table “tasks”, each row containing a simple problem and a solution. Say I want to create a script that would fetch some problems that still lack a solution, process them and store the results back into the table. And for some reasons I want to run several instances of that PHP script in parallel, each instance automatically choosing a group of tasks to work on.

The main problem is how to make sure one instance of the script doesn’t touch rows currently being worked on by another script. At first I thought this would be straightforward to implement since MySQL supports row-level locking when using the InnoDB engine. It seemed intuitively clear – lets just lock the rows in use by one instance so other threads can’t touch them. You can create a row-level lock by using the “SELECT .... FOR UPDATE” query.

That’s where the real trouble began.

I thought I’d use something like this to select the rows to work on –

SELECT * FROM tasks WHERE solved=0 LIMIT 10 FOR UPDATE

Not a good idea. SELECT ... FOR UPDATE doesn’t just lock the rows it selects, it locks all rows it looks at. So if I had more than 10 unsolved problem rows in the table “tasks” they’d all get locked. One could can’t work around this by first finding the matching rows, collecting primary keys from all of them and then locking the rows by “...WHERE primary_key_field IN (X, Y, Z...)“, but that’s beside the point. Even if you only lock the rows you need you might still get in trouble since another “SELECT” (from a different instance of the script) that comes across rows locked “FOR UPDATE” will block until the rows are unlocked.

Maybe there’s a “right” way to solve this, I could not find it even after hours of searching the Web. What I did was add another column to the table and call it “busy_since”. Then I would (pseudo-code follows) –

LOCK TABLE tasks FOR WRITE
SELECT * FROM tasks WHERE solved=0 AND busy_since < '$lock_limit' LIMIT 10
for each selected row {
   UPDATE tasks SET busy_since = NOW() WHERE primary_key = $primary_key
}
UNLOCK TABLES

‘$lock_limit’ is date/time calculated by the PHP script based on a “grace period”; it ensures that “abandoned” tasks don’t remain unprocessed forever.

After the script finished working with the rows it would update them to busy_since = '0'.

In essence what I did was create a kind of home-made “locking mechanism” that marks selected rows as “busy” so they wouldn’t be touched by other scripts. And just in case the script the first script crashes there is a timeout period – rows that have been “busy” for a long time are assumed to be unlocked.

By the way, this is not just a theoretical rant. I had this problem with a new project of mine that… well, let’s just say it has to do with del.icio.us and crawling webpages 😉

Related posts :

Leave a Reply