Using Regular Expressions

This section describes regular expressions and provides information and examples for using them in HomeSite. The rules listed in this section are for creating regular expressions in HomeSite; the rules used by other RegExp parsers might differ.

An excellent reference on regular expressions is Mastering Regular Expressions by Jeffrey E.F. Friedl, published by O'Reilly & Associates, Inc.

About regular expressions

A regular expression is a pattern that defines a set of character strings. The RegExp parser in HomeSite evaluates the indicated text and returns each matching pattern.

Like in arithmetic expressions, you can use various operators to combine smaller expressions; simple regular expressions can be concatenated into complex criteria. For more information, see "Anchoring a regular expression to a string".

In HomeSite, you can use regular expressions for extended searches and validating code. Following is a description of each usage:

For more information, also see "Using extended search commands" and "Validating Code".

Special characters

Because special characters are the operators in regular expressions, in order to represent a special character as an ordinary one, you need to precede it with a backslash. To represent a backslash, for instance, use a double backslash (\\).

Single-character regular expressions

This section describes the rules for creating regular expressions. You can use regular expressions in the Search > Extended Find and Replace command to match complex string patterns.

The following rules govern one-character RegExp that match a single character:

Character classes

You can specify a character by using a POSIX character class. You enclose the character class name inside two square brackets, as in this Replace example:

"Macromedia's Web Site","[[:space:]]","*","ALL")

This code replaces all the spaces with *, producing this string:

Macromedia's*Web*Site

The following table shows the supported POSIX character classes:
Character Class
 
Matches
alpha
Any letter, [A-Za-z]
upper
Any uppercase letter, [A-Z]
lower
Any lowercase letter, [a-z]
digit
Any digit, [0-9]
alnum
Any alphanumeric character, [A-Za-z0-9]
xdigit
Any hexadecimal digit, [0-9A-Fa-f]
space
A tab, new line, vertical tab, form feed, carriage return, or space
print
Any printable character
punct
Any punctuation character:
! ' # S % & ' ( ) * + , - . / : ; < = > ? @ [ / ] ^ _ { | } ~
graph
Any character defined as a printable character except those defined as part of the space character class
cntrl
Any character not part of the character classes:
[:upper:], [:lower:], [:alpha:], [:digit:], [:punct:], [:graph:], [:print:], [:xdigit:]

Multicharacter regular expressions

You can use the following rules to build multicharacter regular expressions:

Using back references

HomeSite supports back referencing, which allows you to match text in previously matched sets of parentheses. You can use a slash followed by a digit n (\n) to refer to the nth parenthesized subexpression.

One example of how you can use back references is searching for doubled words, for example, to find instances of "is is" or "the the" in text. The following example shows the syntax you use for back referencing in regular expressions:

("There is is coffee in the the kitchen",
"([A-Za-z]+)[ ]+\1","*","ALL")

This code searches for words that are all letters ([A-Za-z]+) followed by one or more spaces [ ]+ followed by the first matched subexpression in parentheses. The parser detects the two occurrences of is as well as the two occurrences of the and replaces them with an asterisk, resulting in the following text:

There * coffee in * kitchen

Anchoring a regular expression to a string

You can anchor all or part of a regular expression to either the beginning or end of the string being searched:

Expression examples

The following table shows some regular expressions and describes what they match:
Expression
Description
[\?&]value= 
A URL parameter value in a URL
[A-Z]:(\\[A-Z0-9_]+)+ 
An uppercase DOS/Windows full path that is not the root of a drive, and that has only letters, numbers, and underscores in its text
(\+|-)?[1-9][0-9]* 
An integer that does not begin with a zero and has an optional sign
(\+|-)?[1-9][0-9]*(\.[0-9
]*)? 
A real number
(\+|-)?[1-9]\.[0-9]*E(\+|
-)?[0-9]+ 
A real number in engineering notation
a{2,4} 
Two to four occurrences of "a": aa, aaa, aaaa
(ba){3,} 
At least three "ba" pairs: bababa, babababa, ...
("[A-Za-z]"){2,} 
At least two occurrences of the same word

Comments