Thursday, May 18, 2006

java.lang.StackOverflowError with RegEx backreferences

Just got to the bottom of an issue that I thought I'd share: one particular page on a clients' site was bombing out, saying

Stack Trace: java.lang.StackOverflowError


...with no further debug info about line number, template, or anything useful.

It's a CMS / database driven site, so we downloaded a copy of the live DB to try and replicate the issue locally - no luck, it worked fine.

So straight away, we suspected the code - if it's not the data, it's got to be the code, right? And if it works on local dev boxes, but not live, logically you suspect that the code on live is *shudder* .....not the same as the code on dev....

( Cue lightning flashes, howling wolves, dramatically tense organ music, women fainting with the back of their hand to their forehead, etc etc )

But no - the relevant code was exactly the same on live as in our SVN repository. In fact, neither copy had changed since April *last year*.

So when all else failed, we resorted to the good old-fashioned, tried-and-tested, never-fails method of advanced debugging - *ahem* commenting stuff out until it works, and putting it back in line-by-line until it breaks again, at which point, you should have identified the offending line of code.

We narrowed it down to one particular dsp_ file in the offending fuseaction, but there was still bizarreness aplenty - if I cut out a cfloop around a query, it still died, but if I cut out the surrounding pure html <table> and </table> tags, it worked.

Now how weird is that?

To cut a long story short, after much to-ing and fro-ing and "**** me, that's weird"-ing with three of us round the screen, I eventually realised that it wasn't anything in the dsp_ file at all, the problem was actually due to a Regular Expression we use in the lay_default.cfm layout file, for post-processing of content.

It's a site that uses slash-separated parameter lists for Search-Engine-Safe URLs, but it does this by post-filtering the rendered HTML in fusebox.layout with some reg-ex, looking for href="..." and replacing it.

While it's at it, it also takes the opportunity to strip out any unnecessary whitespace from between tags, like this :


sString = REReplaceNoCase(arguments.sString, ">([[:space:]]){1,}<", "> <", "ALL");


Now, hands up who sees the problem with that? Took me a couple of looks...

The problem is with the placing of the brackets. Let's parse that reg-ex into english:

">([[:space:]]){1,}<" == look for a closing tag followed by at least one whitespace character, and capture it.

Capture what?

This was actually creating one separate back-reference for EACH whitespace character. So if you have 1000 whitespace characters, it creates 1000 backreferences. Evidently someone must have put just enough whitespace into that particular content that on a busy server, the stack was just pushed over the edge.

What it should have done, of course, is put the {1,} (which means "at least once") inside the capturing bracket, like this:


">([[:space:]]{1,})<"


...thus creating one backreference for the whole string of whitespace. In fact, as the replace-with string was just a hard-coded single-space character ( "> <" ) there was no need for the backreference to be captured at all.

So are Regular Expressions evil? No, but they are extremely powerful, and as anyone who has seen Spiderman will know, "with great power comes great responsibility". They can also be a real pain to debug :)

3 comments:

Rob Wilkerson said...

Since you're not using the capture, why capture it at all?

">(?:[[:space:]]{1,})<"

Or even

">\s+<"

Either of those should provide the same result (untested, of course) without any capturing...

Alistair Davidson said...

rob : zactly! See second-to-last paragraph.

Anonymous said...

I know this is a couple years old, but thanks for leaving this up. I have a completely different regex expression, but the concept is similar. This should fix it. Thanks!