The HTML’ers Guide to Regular Expressions Part 1: Cleaning Up Content
Over the years, I’ve written 1000’s of lines of original code, in many languages. Somehow, I missed the chapter on Regular Expression (RegEx), which are pretty scary even for a programmer type, but recently I was forced to learn RegEx to deal with faulty HTML content being returned by a syndication service. An example of a Regular Expression for HTML:
This is the most useful Regular Expression I have in my arsenal for tackling content that has been given to me in an unknown state. Simply put, it removes all HTML tags that start with “<” and end with “>”, regardless of styles, attributes, or other text inside the tag. If I am using this Expression on an HTML page, I use my favorite editor to go through and restyle/structure the document. This is often easier than trying to interpret or correct the existing HTML structure.
If I am programmatically manipulating content (e.g. from a database, RSS, or syndication service) I might not use something less global:
This is more atomic in nature than the first Regular Expression, as it only finds/replaced emphasis tags. How many times have you had to do a global search and replace on a document where you first searched for <em> (and replaced with nothing) and then </em>. This Regular Expression catches both.
Side Note: I tend to preserve HTML entities (like ») as it is not obvious that it is missing from the structure of the document. HTML tags are different, in that they are not part of the content, they are describing the document structure, and through the use of CSS, the presentation.
Here are some other useful expressions for dealing with HTML formatting:
Regular Expressions for manipulating content and HTML in content:
Sources for Helpful Regular Expressions:
These tools are helpful in learning or debugging Regular Expressions:
RegEx Coach (PC) http://weitz.de/regex-coach/
RegEx Buddy (PC) http://www.regexbuddy.com/library.html
RegExWidget (OS-X) http://robrohan.com/projects/widgets/
Text Editors w/RegExp support
From a perspective of having used many different editors over the years, I have definitely built up a bias for certain solutions. I find that the built-in widget of EditPadPro offers the easiest and most efficient use of RegEx in day-to-day HTML editing. Most of the text editors or applications with RegEx support implement it in the Search-and-Replace window, which is useful, but from a usability perspective, you find yourself typing and clicking more to use the dialog type of interface. Nowadays, I use RegEx so much in my work, that the extra typing in clicking is a big deal. You may not care about this at all.
Lots of applications support RegEx:
Macromedia Dreamweaver (Find/SnR Dialog)
Microsoft Frontpage (Find/SnR Dialog)
Microsoft Visual Studio
SlickEdit (Find/SnR Dialog)
e/TextMate (Find/SnR Dialog)
EditPadPro (has a great SnR w/RegEx widget in the main edit window)
VIM (Find/SnR Dialog)
Eclipse (Find/SnR Dialog)
Aptana (widget in the main edit window)
Programming & Scripting Languages that support Regular Expressions
In fact, most programming languages support RegEx through various libraries or objects. We call out this list so that if you happen to use one (or many) of these, you know you have a simple way to try out RegEx in a familiar environment.
VB, VBScript, VB.NET, C#, VBA
* Some would say PERL and Regular Expressions are too closely knit to consider RegEx a “part” of PERL. The truth is that PERL can be thought of as an extension of RegEx.