regex vs parsing

Jeff Atwood wrote a really good blog post about the perils of abusing regular expressions a while back which you can read here. The basic gist was that you shouldn’t rely on using regexes for the solution to every problem (which languages like perl encourage you to do). I agree with this, however, there are some cases where regular expressions are simply the most expedient (but perhaps not the most efficient) solution where you just cannot ignore the simplicity and convenience they offer.

One of the things we do at work is analyze html from news sites. Lots of it. We need to be able to look at a page of html and extract certain sections for information gathering/processing. Among the programming purists, they will tell you that you MUST write a proper html parser with a lexer and tokenizer, because there exists some input that will break your regex. That’s fine and dandy for something with a broad input spectrum (blogs, forums, etc), but we deal with the output from mainly news CMS systems. In fact, many news organizations use similar CMS systems which make this job a lot easier. You just have to see the CMS output pattern/template and write a regex to extract the info you need.

What are the advantages to this? Fast turnaround time. In an industry that’s fast paced and constantly changing, you can roll out changes and keep up with any news site that decides to change their article structure every few months. Instead of retooling a dom-based parser where you’d have to probably change a complicated document definition for every source, you can just simply adjust a client-specific regex and be rolling in a matter of minutes.

What are the disadvantages? Efficiency. Regexes aren’t known for being the fastest tool in the programmer’s toolbox. While your regex is being evaluated, there are hundreds, if not thousands of computations going on as well as regex trees being built to see if there’s a match.

In the end, researching a non-regex based solution to this problem isn’t exactly a waste of time. But you simply cannot ignore the simplicity this solution offers in this type of industry-specific situation. Most of the arguments I’ve seen against regex-based parsing of html revolve around hypothetical input scenarios, but I’ve rarely seen anything that I couldn’t write a regex for. Even something as bad as this (which was lifted from one of the biggest news sources around):

<span class="focusParagraph"><p>
<span class="articleLocatio</span>n">

Keep It Simple Stupid

Keep It Simple Stupid (KISS) – the old adage impressed onto all computer science students their freshmen year. I was reminded of it just now when I finally figured out a problem that was hounding me for a while on one of my test development sites. I’m trying to delve into the world of python web development, but was having a really weird problem where every file I uploaded to was zeroing out on the server. At first I thought maybe it was a temp directory issue, or a permissions issue. Finally, I tried logging into WHM to see if I had gone over my file quota. Well whaddaya know. I forgot to configure the domain with a development package. Oops. On the plus side, python development is now a GO!

I think I’ll try writing a simple CMS from scratch to get a basic feel for the language and then graduate to one of the available web frameworks like web2py or django. I want to try and always have something in the works so that I’m constantly learning something new. Working at a Microsoft shop isn’t a bad thing, but I feel like I’m being pigeon holed into a specific area of the programming industry that is dependent on the direction given by a handful of guys in Redmond. I’d rather keep my options open.

In other news, I’m still finishing up my startrek episode review/discussion site, So far, it’s got Google OpenID and facebook graph API authentication working. I’ve got adding, editing, and deleting comments for episodes working. I have a rudimentary wiki scrape script that I ran a while ago on, the ultimate star trek reference site. I think all I really have left to do for this first iteration is public facing user profile pages, maybe some kind of unique username system, perhaps a more clearly defined “home” or front page for the user, and anchoring for comments (to facilitate direct linking).

Review: Red Dead Redemption



GTA4 in the wild wild west. That is essentially the entire game summed up in a nutshell. If you didn’t like GTA4, I seriously doubt you would like RDR, but who knows. It’s an open world 3rd person action game with a main quest line and optional side quests.


RDR handles just like GTA4, so depending on what you thought of that, you may or may not like it. For me, the controls were straightforward and the gameplay was smooth. I’m not the best at handling myself in Rockstar’s games (ie. bumping into sides of door entrances, etc), but I think that’s a fault of my own and not the game’s. The aiming system made much of the combat effortless, but you still had to duck and find cover if you didn’t want to get pumped full of holes. One thing that annoyed me was the lassoing. It could be my fault too, but it just seemed like it wasn’t as smooth as it should have been. Perhaps I was hitting “Y” too many times though because I’d catch up to someone, lasso them and dismount, but then the lasso would not stay and if I wasn’t fast enough at lassoing them while dismounted, I’d have to go back on my horse and try to do it again until I got them on the ground and hogtied them. Again, it’s most likely my fault, but it sure was annoying since you’re enticed into bringing in bounties alive to get more money.


I thought RDR looked and sounded fantastic. There were times when it almost seemed like a movie to me. The cutscenes weren’t as crisp as they could have been, but the in-game graphics were great. I loved the little audio snippets John would throw out while in combat as well. Trash talking in the wild wild west? Yes please! Also, I know I’m not the only one who would occasionally ride between locations (instead of fast traveling) just so I could watch myself ride into the sunset.


I’ll admit that it took me a few hours to get into RDR initially, but after that I was hooked. I don’t want to spoil anything, but the game’s ending was just extremely poetic to me and opens up the possibility of a sequel (which I hope Rockstar is working on right now). If you’re not a fan of the GTA series or of open world sandbox games, RDR probably isn’t for you. However, if you are a fan of those things, pick this up. I’ve heard the multiplayer is also really fun, but have yet to play that.

Happy New Year 2011

Happy New Years!  Everyone has lame resolutions that all revolve around health and food.  Since I don’t really need to lose any weight and I’ve already quit smoking, my 2011 resolutions are as follows:

  • Play MORE video games.  Try and beat them too since I have a nasty habit of just dropping them midway through.
  • Learn Python and Java.
  • Try and complete more of my programming side projects instead of leaving them half-done (this seems to be a recurring theme in my life).
  • Learn to drive stick.
  • Try and do some more grad school classes.  I had put them off to pay off some debt first, but I should really try and get this done.
  • Pay off debt.
  • Read more books.
  • Be less political.

I think I had one resolution in 2010 and that was to quit smoking.  Well, I got that done!  Let’s see how many I get done for 2011.