Thursday, September 09, 2004

Regular Expressions

Pete has some reasonable complaints about the cryptic nature of regular expressions. They're powerful but difficult to read. The example he provides for matching an email address is horrible. There ought to be a better way to describe this. Pete provides one approach. This is similar to using a BNF notation to break up the expression into named components. An alternative that doesn't require inventing a new language would be to build up a regex from basic string chunks and document how the pieces come together. In the O'Reilly book Mastering Regular Expressions (aka Hip Owls book) the author includes Perl code that does this for a complex email address regex. The code is quite readable, much better than the resulting 6,598 character regex expression. If you want to learn more about regular expressions, this is a good book. It describes how regex matchers work, the variants of regular expression in Perl, Tcl, Awk, GNU Emacs, etc.

I've also found a couple of applications that can be useful for explaining and debugging regular expressions:

The Regex Coach is "donationware". It was written Dr. Edmund Weitz in Common Lisp and runs as a standalone application on Windows and Linux. It can show a tree representation of a regular expression and single step through the matching.

RegexBuddy is a commercial product and is somewhat slicker than RegEx Coach. Here's RegexBuddy's explanation of Pete's email address regular expression. Notice how it explains each portion while highlighting text. Also, if you click through the links it gives more details on the rules of regular expressions. Cool.