Q: What
the !^.*$! is a "regular expression"?
- Baldemar
A: Regular
expressions are programming constructs that look like #!?#@!!# comic-book
expletives and can be wondrously powerful tools.
Regular expressions are used to recognize patterns within textual
data. Their use has become so widespread that they appear in
configuration files, mail filters, text editors, and any number of
programming languages. Any application that acts on text may very well
harness their power.
Regular expressions evaluate text data and return an answer of true or
false. That is, either the expression correctly describes the data, or it
doesn't. What data the expression evaluates and what transpires after a
successful match depends entirely on the application. We might substitute
new text in the place of the text matched by a regular expression. We
might save the matched text in a variable for later use. We might execute
a new program when we see a correct match. And so on.
There are several variants, but all regular expressions consist of
characters to be matched as well as a series of special characters that
can be said to further describe the data. In Unix, the grep
utility is a simple starting point for understanding the work of regular
expressions. The expression can be a simple string, and the input data
can be a named list of files. Let's look at some examples, using grep
to get a sense of how regular expressions work.
Let's say we want to find all the <title> tags in a directory
of HTML files. The code would look like this:
grep -i '<title>' *.html
grep evaluates whether or not each line in each *.html
file matches the description <title>. If the line is a match,
then grep's standard behavior is to print out the file name and
the matching line.
Pretty soon, we'll want to ask more sophisticated questions of our
text data. We may want to add further restrictions and qualifications, or
we may want to make our expression more general. In short, we'll need to
start using regular expressions' set of descriptive "metacharacters."
Let's look at a few cases.
Placeholders and repetition:
Let's say our directory of HTML files has 100 files and 100
<title> tags, and we want to narrow our search a little to see only
the titles that make reference to "worms."
grep -i '<title>.*worms'
We've introduced two new metacharacters. The "." means "any
character." The "*" means 0 or more instances of the previous character.
What we've said here is "match any line that contains a 'begin title' tag
followed by any number of characters, as long as the word 'worms' appears
before the end of the line." The "." is very important. If we'd said:
grep -i '<title>*worms'
then we'd be looking for lines that looked like this:
<title>>>>>>>>>>>>>>>>>worms.
(The * character would be looking for 0 or more instances of
>, which is not very useful.)
Range:
We frequently find that we want to make our expressions much more
general. It would be quite inconvenient to enter 10 regular expressions
if we're only interested in matching any of the characters from 0 to 9.
The range symbol [] allows us to conveniently group characters
together. We can also use [\.\*] to match either of those
punctuation characters. (NOTE: We put backslashes before dots and stars
in order to turn off their behavior as special characters. This is called
"escaping" the characters.)
One especially powerful feature of the range function is the ability
to negate it. We can match "anything but" the list of characters. In
[^1234], the caret inside this range operator means "match anything
but the characters 1-4."
Here's a useful example: Find all the hrefs that point to
URLs that mistakenly have a space in them. This example uses the enhanced
regular expressions of egrep.
egrep -i 'href="[^"]* [^"]*"' *.html
In other words, find the href lines that have a space between
the begin quote and end quote. We use the range operator here to signify
any character other than a quote.
Position:
There are two main characters that enable us to restrict our match to
a location within the string. We can match either the beginning (^)
or the end ($) of our input data. This is more useful than it
might immediately seem.
For example, let's say we want to find the HTML tags that are not
closed before the line break.
egrep '<[^>]*$' *.html
In other words, we're looking for a "less than" followed by a
continuous chain of characters other than "greater thans" all the way to
the end of the line.
That should suffice as an introduction. The set of special descriptive
characters will differ across regular-expression implementations, but if
you keep in mind that their uses fall into a few basic categories, you'll
have no trouble learning them. Position, range, repetition, and
placeholders are the foundations of regular expressions.