regex vs parsing

Jeff Atwood wrote a really good blog post about the perils of abusing regular expressions a while back which you can read here. The basic gist was that you shouldn’t rely on using regexes for the solution to every problem (which languages like perl encourage you to do). I agree with this, however, there are some cases where regular expressions are simply the most expedient (but perhaps not the most efficient) solution where you just cannot ignore the simplicity and convenience they offer.

One of the things we do at work is analyze html from news sites. Lots of it. We need to be able to look at a page of html and extract certain sections for information gathering/processing. Among the programming purists, they will tell you that you MUST write a proper html parser with a lexer and tokenizer, because there exists some input that will break your regex. That’s fine and dandy for something with a broad input spectrum (blogs, forums, etc), but we deal with the output from mainly news CMS systems. In fact, many news organizations use similar CMS systems which make this job a lot easier. You just have to see the CMS output pattern/template and write a regex to extract the info you need.

What are the advantages to this? Fast turnaround time. In an industry that’s fast paced and constantly changing, you can roll out changes and keep up with any news site that decides to change their article structure every few months. Instead of retooling a dom-based parser where you’d have to probably change a complicated document definition for every source, you can just simply adjust a client-specific regex and be rolling in a matter of minutes.

What are the disadvantages? Efficiency. Regexes aren’t known for being the fastest tool in the programmer’s toolbox. While your regex is being evaluated, there are hundreds, if not thousands of computations going on as well as regex trees being built to see if there’s a match.

In the end, researching a non-regex based solution to this problem isn’t exactly a waste of time. But you simply cannot ignore the simplicity this solution offers in this type of industry-specific situation. Most of the arguments I’ve seen against regex-based parsing of html revolve around hypothetical input scenarios, but I’ve rarely seen anything that I couldn’t write a regex for. Even something as bad as this (which was lifted from one of the biggest news sources around):

<span class="focusParagraph"><p>
<span class="articleLocatio</span>n">

Leave a Reply

Your email address will not be published. Required fields are marked *