(651) 237-9922

The HTML'ers Guide to Regular Expressions

The HTML'ers Guide to Regular Expressions Part 1: Cleaning Up Content



Over the years, I've written 1000's of lines of original code, in many languages. Somehow, I missed the chapter on Regular Expression (RegEx), which are pretty scary even for a programmer type, but recently I was forced to learn RegEx to deal with faulty HTML content being returned by a syndication service. An example of a Regular Expression for HTML:
<.*?>

This is the most useful Regular Expression I have in my arsenal for tackling content that has been given to me in an unknown state. Simply put, it removes all HTML tags that start with "<" and end with ">", regardless of styles, attributes, or other text inside the tag. If I am using this Expression on an HTML page, I use my favorite editor to go through and restyle/structure the document. This is often easier than trying to interpret or correct the existing HTML structure.



If I am programmatically manipulating content (e.g. from a database, RSS, or syndication service) I might not use something less global:
<*.em*.>

This is more atomic in nature than the first Regular Expression, as it only finds/replaced emphasis tags. How many times have you had to do a global search and replace on a document where you first searched for <em> (and replaced with nothing) and then </em>. This Regular Expression catches both.



Side Note: I tend to preserve HTML entities (like ») as it is not obvious that it is missing from the structure of the document. HTML tags are different, in that they are not part of the content, they are describing the document structure, and through the use of CSS, the presentation.



Here are some other useful expressions for dealing with HTML formatting:



Regular Expressions for manipulating content and HTML in content:





Sources for Helpful Regular Expressions:

http://regexlib.com/Search.aspx?k=HTML

http://www.regular-expressions.info/examples.html



RegEx Tools

These tools are helpful in learning or debugging Regular Expressions:

RegEx Coach (PC) http://weitz.de/regex-coach/

RegEx Buddy (PC) http://www.regexbuddy.com/library.html

RegExWidget (OS-X) http://robrohan.com/projects/widgets/



Text Editors w/RegExp support

From a perspective of having used many different editors over the years, I have definitely built up a bias for certain solutions. I find that the built-in widget of EditPadPro offers the easiest and most efficient use of RegEx in day-to-day HTML editing. Most of the text editors or applications with RegEx support implement it in the Search-and-Replace window, which is useful, but from a usability perspective, you find yourself typing and clicking more to use the dialog type of interface. Nowadays, I use RegEx so much in my work, that the extra typing in clicking is a big deal. You may not care about this at all.



Lots of applications support RegEx:

Macromedia Dreamweaver (Find/SnR Dialog)

Microsoft Frontpage (Find/SnR Dialog)

Microsoft Visual Studio

SlickEdit (Find/SnR Dialog)

e/TextMate (Find/SnR Dialog)

EditPadPro (has a great SnR w/RegEx widget in the main edit window)

VIM (Find/SnR Dialog)

Eclipse (Find/SnR Dialog)

Aptana (widget in the main edit window)



Programming & Scripting Languages that support Regular Expressions

In fact, most programming languages support RegEx through various libraries or objects. We call out this list so that if you happen to use one (or many) of these, you know you have a simple way to try out RegEx in a familiar environment.

PERL*

JavaScript

PHP

Python

Ruby

C, C++

VB, VBScript, VB.NET, C#, VBA



* Some would say PERL and Regular Expressions are too closely knit to consider RegEx a "part" of PERL. The truth is that PERL can be thought of as an extension of RegEx.

Why Choose MagicLamp Networks?

Results-Driven. Focused on Your Success.

We know that your website is a financial asset to your business. We're here to help you reach your business goals and get you the highest possible return on your investment. Our clients' ROI is how we measure our success.

Business-Specific Design.

We help you assess your business needs quickly and accurately, identify what sets you apart from the pack, and communicate it in ways that make clients listen. Because we build sites from the ground up, we can give you the most effective custom tools for marketing your specific products or services.

Complete End-to-End Services.

Don't have your own designer? Not sure how to word your mission statement? No problem. Whether it's branding, copywriting, or full-service technical support, MagicLamp provides everything you need to build a complete web presence.

Online Marketing Expertise.

We use current "white hat" SEO practices to push your site high up in search result ratings so you see traffic rolling in as quickly as possible.

Over 20 Years Of Client Partnerships.

We’ve been making websites since the internet was young. We build long-term relationships with our clients based on results. Many of our current partners have trusted us with their online success for over 5 years.

Websites that Work the Way You Do.

Our sites work the way you do, not the other way around. There's no need to change the way you already do business when you choose Magic Lamp. We’ll make sure your website complements your workflow without complicating it.