The HTML’ers Guide to Regular Expressions

The HTML’ers Guide to Regular Expressions Part 1: Cleaning Up Content


Over the years, I’ve written 1000’s of lines of original code, in many languages. Somehow, I missed the chapter on Regular Expression (RegEx), which are pretty scary even for a programmer type, but recently I was forced to learn RegEx to deal with faulty HTML content being returned by a syndication service. An example of a Regular Expression for HTML:

<.*?>


This is the most useful Regular Expression I have in my arsenal for tackling content that has been given to me in an unknown state. Simply put, it removes all HTML tags that start with “<” and end with “>”, regardless of styles, attributes, or other text inside the tag. If I am using this Expression on an HTML page, I use my favorite editor to go through and restyle/structure the document. This is often easier than trying to interpret or correct the existing HTML structure.


If I am programmatically manipulating content (e.g. from a database, RSS, or syndication service) I might not use something less global:

<*.em*.>


This is more atomic in nature than the first Regular Expression, as it only finds/replaced emphasis tags. How many times have you had to do a global search and replace on a document where you first searched for <em> (and replaced with nothing) and then </em>. This Regular Expression catches both.


Side Note: I tend to preserve HTML entities (like ») as it is not obvious that it is missing from the structure of the document. HTML tags are different, in that they are not part of the content, they are describing the document structure, and through the use of CSS, the presentation.


Here are some other useful expressions for dealing with HTML formatting:


Regular Expressions for manipulating content and HTML in content:



Sources for Helpful Regular Expressions:

http://regexlib.com/Search.aspx?k=HTML

http://www.regular-expressions.info/examples.html


RegEx Tools

These tools are helpful in learning or debugging Regular Expressions:

RegEx Coach (PC) http://weitz.de/regex-coach/

RegEx Buddy (PC) http://www.regexbuddy.com/library.html

RegExWidget (OS-X) http://robrohan.com/projects/widgets/


Text Editors w/RegExp support

From a perspective of having used many different editors over the years, I have definitely built up a bias for certain solutions. I find that the built-in widget of EditPadPro offers the easiest and most efficient use of RegEx in day-to-day HTML editing. Most of the text editors or applications with RegEx support implement it in the Search-and-Replace window, which is useful, but from a usability perspective, you find yourself typing and clicking more to use the dialog type of interface. Nowadays, I use RegEx so much in my work, that the extra typing in clicking is a big deal. You may not care about this at all.


Lots of applications support RegEx:

Macromedia Dreamweaver (Find/SnR Dialog)

Microsoft Frontpage (Find/SnR Dialog)

Microsoft Visual Studio

SlickEdit (Find/SnR Dialog)

e/TextMate (Find/SnR Dialog)

EditPadPro (has a great SnR w/RegEx widget in the main edit window)

VIM (Find/SnR Dialog)

Eclipse (Find/SnR Dialog)

Aptana (widget in the main edit window)


Programming & Scripting Languages that support Regular Expressions

In fact, most programming languages support RegEx through various libraries or objects. We call out this list so that if you happen to use one (or many) of these, you know you have a simple way to try out RegEx in a familiar environment.

PERL*

JavaScript

PHP

Python

Ruby

C, C++

VB, VBScript, VB.NET, C#, VBA


* Some would say PERL and Regular Expressions are too closely knit to consider RegEx a “part” of PERL. The truth is that PERL can be thought of as an extension of RegEx.

Conversion Rates & Web Advertising

Introduction


I saw this article, in the news, which is about a company suing Google, Overture, and others over click-fraud. In the article, the company states “Lane’s said ads are often clicked only to generate a bigger bill for advertisers, not by someone truly seeking more information.” One could come to that conclusion when running a web advertising campaign and seeing ZERO results. Another conclusion that could be reached is that the site has such poor conversion capabilities, that the site may never get a real customer. If I were Googe et al., I would seriously consider getting usability and e-Commerce experts to look at the deficiencies in the site and see if the second conclusion is more likely (statistcally speaking).

Wide and Deep


Conversions on the web are never cut-and-dry, just like direct mailer promotions. Even the best turn-key systems need refinement and some experimentation to yeild the best results. Often, defining what really constitutes a conversion can lead to interesting avenues of exploration for increasing conversions, because there is more to go after. Consider that not all visitors to your site are in a buying mood. Would you rather have them just leave the site, or perhaps leave their email address or other contact information? An “add me to your news list” is one way to squeeze more blood from the conversion turnip. There are many other examples of broadening the horizon. A good place to start is looking at competition, and then scanning the various books about e-commerce design to build a list. Prioritize this list against what you understand your visitor””s needs to be. When in doubt, setup a survey!

Usability & Functionality


I was recently drilling down into a product sales web site that was developed by another company, and discovered that the usability was leaving much to be desired. To add insult to injury for the random customer, the shopping cart was broken, and the checkout process what really basic and did not recover well when I went back and forth. We were already working with the owners of the site to give it a total make-over, and I asked for all of the web logs. After running some extensive analysis, the raw unique-to-conversions was 0.02% averaged over six months. Once the site is up and running on an e-commerce engine that actually works, it will be telling to see the difference in conversion rates. More: In MagicLamp Networks Newsletter Volume 4, we looked at how poorly designed sites can make web advertising a bust.

Suspect SEO Partners (was SEO Scam)

In this day and age, everyone is trying to figure out how to get their website more visibility. Of course, obtaining position #1 for all of your key words/phrases seems like the Holy Grail. Some one calls you up, asks you if you have good standings for all of your phrases, and when you answer “No”, they then launch into a pitch about how they (and only they) can give you rankings.

So how does it play out in real life?
1) SEO company calls website owner: “Hey, how would you like to be ranked #1 for any search term you want?”
2) Customer agrees to pay $5k up-front for “work” on their site, and additional $500/month maintenance
3) SEO Company gets customer to top by “spamming” and/or “cloaking”
4) Customer stays at top for 6 months (additional $3k)
5) Customer gets banned by search engines for spamming and/or cloaking
6) SEO company walks away with $8k for 6 months of traffic = expensive
7) Customer has to pay someone else to come in and remove spam/cloak code from website AND get off the banned lists with the search engines

Spamming refers to two techniques, one involves fabricating huge numbers (1,000’s) of websites with your key words all over them, and point back to your site, which is far more effective than the second, which is to fabricate huge numbers of pages on your site with high but varying degrees of density for your key words. Google, Yahoo!, and MSN Search all expressly forbid this type of “gaming” of their search engines and spiders.

Cloaking is far more dubious, as the technique can have genuine applications. Cloaking is basically a piece of code that sits on every page, and detects what type of browser is asking for a given page. Based on the result (IE6, Firefox, googlebot, Yahoo! Slurp) the code decides what content will be served up. Why would this be nefarious, you ask? Well, if its googlebot, and the code feeds googlebot with a page of keywords and nothing that is human readable, versus showing the latest sales pitch to an IE6 browser, you start to see how this can go wrong.

A combination of both spamming and cloaking seems to be M.O. for many of these firms at the moment, but there will surely be other ways to “game” the search engines.

There are, of course, other ranking scams that are not as damaging, but seem to be even more pervasive, and just as costly. We call them “who searches by that?”. The sales person calls you and offers to help find key words that their company has “exclusive rights” to on Yahoo! or Google’s promotion placement (top or right side). They then look up how much that search term is, and come back with an amazingly low price. “Oh look, that’s only $85 a month. How about a couple of other search terms then? “ By the time you get off the phone, you are committed to about $500 a month, for key phrases that end up yielding few click-throughs.

How do we know this is a scam?

Most customer who buy into this have NO WAY OF MEASURING the effectiveness, and often go months with 1-10 click-throughs per month. At these prices, this is often 100X the going rate for advertising under more potent key phrases on Overture or Google’s AdWords. In other terms, you could be generating 1000 click-throughs for every 10 you get with the “top promotion placements”.

You will notice that we have not named any particular organizations in this article. There are several reasons for this, most importantly, they change their name often. This article is not meant to be alarmist, our only intent is to arm you with the knowledge to make good business investments.

 

Don’t use email addresses for logins!

Why?

From a discussion on Slashdot:

pros for using email as login:

1. guaranteed unique, though you’d be a fool to not have check.
2. users forget it slightly less
3. you have to send verification/password anyway

cons for using email as login:

1. What if a user has more than one email address?
2. Email addresses make reasonable unique keys, but slow indexes, especially since many are very similar
3. users may use disposable [spamgourmet.com] email addresses and suddenly you cannot contact them

However, if you read what prompted the discussion in the first place:

CNet is running a story about how spammers and phishers can learn about our surfing habits to better target their attacks. According to the article, web sites that use e-mail addresses as IDs are vulnerable to attacks that could leak their users’ email addresses. These attacks are performed by requesting a password reminder for an address or trying to register with it.”

You begin to see other problems more related to security and privacy, rather than just design/implementation issues.

The best quote though:

“Here’s another one, and it ties into the original posting: it’s the same problem as using biometrics for identification: using an ID or password that’s hard to change. You don’t want to use that kind of ID casually, because you want to make sure that people who have your ID have an incentive to be at least as careful with it as you would be.

If you use your thumbprint to pay for a drink at a bar, how good a job do you think the bar is going to do about making sure someone else doesn’t game their sensor with a bit of latex on their fingertip? If someone steals your credit card, you can cancel it and get a new credit card. If someone steals your thumbprint you’re hosed.

This is the same kind of thing. If someone finds out that there’s someone with the handle “fishdan” on slashdot, they don’t have anything useful. If they have your email address, they have something useful that’s hard to change (look at me, I’m using year-tagged email addresses and I’m thinking of going to month tags). Plus, if you DO change your email address you have to change it EVERYWHERE (which is why I’ve got spam filters that reject entire countries for my main email address… because I’ve had it for about as long as personal domains have been available and I’m really loath to dump it).

And because of all this, what this means is that all email addresses have to be treated as disposable, even the supposedly private ones you use for account registration only. Which means that now your email address has the same problem as any other name: you have to remember a bunch of them, you have to remember where you used them, and if you only keep ’em long enough for the verification you can’t relogin with the old address.”

Ultimately, you can’t treat email addresses as a no-collision domain, and worse, you have to treat them as disposable.

 

Why Hit Counters Are Misleading

A client designer recently requested that we put a site counter on one of their customer’s websites. I asked why the customer didn’t just use the web server logs with a web analytics tool to see what the traffic for the site was. The answer was first, no one had access to the raw log files, and the hosting provider used for this site in particular didn’t generate any type of web reports. It turns out the customer is paying $500 a month for web advertising, and has no idea if it is even helping.

Having a site counter can be misleading, especially if you are using it to decide if your web advertising campaign is worth it or not. Why?

1) Hit counters are incremented for anything that visits your site, including web crawlers like Googlebot, Yahoo! Slurp, and MSNBot. MSNBot has been hitting many sites like crazy, and so a hit counter could reflect 1000 hits in one day, but they are all from nobody, effectively.

2) Hit counters only show a small part of the picture when it comes to traffic. There are many other metrics that need to be considered. Examples are unique IP addresses, page views (impressions), and source of click-throughs are just the tip of the ice berg. Your traffic could be coming from many sources, and assuming that a hit counter showing a high level of traffic means your web advertising campaign must be working is flawed thinking and could lead to many wasted dollars.

3) Hit counters can be manipulated. If someone goes to your site and hits the refresh button in their browser 10 times, guess what? The hit counter goes up by 10. Not very useful.

So what is the solution?

All customers hosted by MagicLamp Networks have web reporting available to them as a small upgrade to their web hosting package. These web reports are generated by a tool called Wusage, and it by far one of the most extensive reporting tools available.

If you have access to your web logs, there is are several free tools available:

WebLog Expert Lite I use this tool from time to time, it’s numbers are a bit suspect, but if you stick with this as your primary tool, you can get reliable over-time statistics.

Analog Very basic, but provides the most crucial numbers about your traffic. Don’t expect to figure out things like where people are stopping in your shopping or lead generation process, but again, it will get you started.

 

301 Redirects on IIS using ASP

If you are reading posts about “Google’s Duplicate Content Penalty” in reference to sites that your site could be reached at www.mydomain.com and mydomain.com, with the common remedy of using a 301-redirect, but run your site on IIS, this article is for you. The 301 code cited was undoubtedly Apache-targeted, in the form of a sample htaccess file and which isn’t really applicable to you.

In terms of options for 301-redirects on IIS, what I found on the web was interesting:
1) Pure IIS based redirects (via the IIS MMC Snap-in and the metabase)
2) ISAPI Redirect Filters – seemless integration into the webserver itself
3) ASP (woohoo!)

For IIS based redirects, the problem (if you don’t want to pay for an ISAPI filter) is that not only do you need admin-access-with-GUi-shell to the server, but you also had to create a separate instantiation of each site, one with the 301-redirect, and then another with the real site. Doubling up the instantiations could press the limits of the metabase on 200+ domain servers (and this is for 2GHz 1GB RAM machines). We could solve the admin-access-with-GUi-shell with some swift ADSi code, but why bother?

In walked the solution:
http://evolvedcode.net/content/code_relocated/

This website is trully a resource for people doing ASP code, and highly recommend reading through his other articles. Evolved Code’s samples will need som extensions for proper 301 redirects if you are handling more than one page that needs this functionality, but it will definitely get you started.

Another site with good example code for .NET is:
http://www.kevinleitch.co.uk/wp/?p=218

Happy Redirecting!

 

Why Usability Effects The Bottom Line

It is important to understand that the core strategy for most websites can be summarized as a formula, where

Visitors * Conversion Percentage * Order Price = Revenue

This could be translated to sites that are intended to produce leads.

Usability dramatically effects the conversion percentage of a site. Therefore, proper planning and design of the navigation and usage models are important to the bottom line of a website.

Consider the following example:
A designer decides to be “bleeding edge” and put all of the menus for navigation of the web site under a single, unlabled, dot. Visitors come to the site, can’t find anything, and leave. Never make your visitor have to learn how to navigate your website.

Vol 4: Turning Traffic into Dollars

In the previous articles addressing Search Engine Optimization and Web Advertising Campaigns (CPC Campaigns), we discussed overall strategies for bringing traffic to your website and a step-by-step list on how to get your site ready for search engines. In this article, we look at something just as important: understanding if your site is ready for paid advertising and CPC campaigns.

A common tactic to address slow web sales is to pay for advertising on various web networks (Google or Overture being the most pervasive). This, however, may not be the right “first thing” to actually do. To understand why, let’s look at a simple formula that describes web sales:

Traffic to the Site * Visitor Conversion To Customer * Average Order = Web Revenue

What does each part of the formula mean?

“Traffic to the Site” is just what it sounds like: if you have 1000 unique visitors (in web reporting this is known as Unique IP Addresses) coming to your site each month, then you can plug “1000” into that part of the formula. This number is not equitable with “Hits” or “Page Views” as these are not representative of actual people coming to your site, only how many pages are being viewed.

“Visitor Conversion To Customer” means the percentage at which your site converts a new, random, unique visitor into an actual, buying customer. If you have 100 people coming to your site in a month, and only one places an order, your conversion percentage is 1%. Conversion percentages for web sites vary dramatically, with most sites shooting for 2-5%.

“Average Order” is simply the averaged total (or subtotal) of web orders placed. If a site is selling widgets for $19.95, and the average order is for 2 widgets, then the average order subtotal would be $39.85.

So, what does all of this mean? We all learned in math that anything that is multiplied by zero is still zero. Applying this to the above formula shows that increasing web traffic to your site could still lead to a ZERO. If your conversion percentage is 0% (or something lower than 1%), paying for traffic into your site is a bad investment.

When does paying for traffic become worthwhile? This is dependent on your profit margins (or cost of goods), conversion percentage, and average web orders. To illustrate this, let’s start with the above example of selling widgets, with two widgets being sold for each order and assume the profit margin is 50%. That means for every order, $19.95 is profit. We also need a conversion percentage, let’s say 5%. This would mean for every 1000 visitors, 50 people actually buy something from you, for a total profit of $997.00. If the key words and phrases you want to advertise with cost you more than $997 per 1000 visitors, then you need to get the order average OR the conversion percentage up, in order to justify the advertising cost. If the cost of those 1000 visitors is lower than $997, then the potential exists for this to be a lucrative opportunity. Clearly, this is a simple analysis, and each business will have to modify the formula to match their business.

What do you do if your conversion percentage is lower than 1%, or you need to get your conversion percentage higher? The short answer is: find out why visitors are not buying. We understand, this is easier said than done. Your best bet in getting started is with customers you already have. Ask them what they liked and disliked about your site. Do not fall into the trap of jumping on the obvious answers or the first answer to come to you. Often times, there are multiple reasons why your conversion percentage is lower than it could be. The web reports your site generates automatically can also give you insight into why visitors are not buying, but analysis of these reports usually requires an experienced person who knows how to interpret the numbers.

Here is a short list of common issues we see, that once fixed, led to higher conversion percentages:

1. Hard-to-use or unintuitive web site navigation
2. Low quality product pictures
3. Product descriptions meaningless or too short
4. Splash page for the front page
5. Heavy use of graphics makes pages download too slow
6. Shopping experience confusing or too many distractions

All of these can be summarized as: your website does not meet visitor’s expectations. Figure out what their expectations are, change the site to meet their expectations, and watch your website grow!