Lesson 3 | Regular expression reference |
Objective | Write a regular expression that will catch most common misspellings of your name. |
Regular Expression Reference
Perl regular expressions are based on the standard
egrep
-style (so-called
version 8) regexps.
These regexes perform pattern matching based on a set of rules. The basic set of rules are explained in this lesson.
For the purpose of the examples in this discussion, we will use the simple form of Perl's pattern-matching
operator (m//
):
For a review on the matching operator, see The
match operator lesson from Module 3.
Pattern - Matching Rules
There are a lot of details in this lesson that we will be using later on in the module. Be sure to read each of the paragraphs below as well as the linked pages from this lesson. In addition, we will apply the regular expressions discussed to the yes/no
if
structure we examined in the previous lesson. Any single character matches itself, unless it is one of the
recognized metacharacters.
- Perl Metacharacters Example:
Note: By now you have noticed that some characters in regexes have a special meaning.
These are called metacharacters. The following are the metacharacters that Perl regular expressions recognize:
{} [] () ^ $ . | * + ? \
If you want to match the literal version of any of those characters, you must precede them with a backslash, \. As you go through the chapter, the meaning of these metacharacters will become clear. These are the recognized metacharacters:
+ ? . * % $ ( ) [ ] { } | \
For example,
/$15/
will not match this pattern:
Can I borrow $15?
If any of the metacharacters are present in your expression, and you are specifically looking for that character, you will need to escape it in order to have it included in your results. In the above search example, use:
/$15/
You can also use special metacharacters to match the beginning or end of a line or string .
- Perl Special Metacharacters:
The following special metacharacters have these special meanings:
^
matches the beginning of the line or string.
$
matches the end of the line or string.
When the special metacharacter ^
is used outside of a bracketed character class, it means "the beginning of a line or string." However, when ^
is used inside a bracketed character class, it negates the immediately following character or group of characters. Here is an example of how you would apply the special metacharacters to our yes/no if
structure:
if($input=~/^[Yy](es)?$/)
{ print "Let's play!\n" }
else
{ print "Okay. Thanks anyway.\n" }
Let us examine the regular expression:
~/^[Yy](es)?$/
The special metacharacter ^
matches the beginning of the string.
[Yy]
matches 1 character from a set of either Y
or y
.
(es)?
matches a pattern of es
either 0 or 1 times.
$
matches the end of a string.
Brackets are used to create your own class of characters.
- Matching a Class of Characters:
Here are some examples of matching a class of characters:
[A-Z]
will match any uppercase character
[0-9]
will match any digit
[Nn]o
will match No or no
A negative class (anything except the class) can be created by using the ^
character.
[^A-Z]
will match anything except an uppercase character
[^0-9]
will match any nondigit
Now let us apply this to our yes/no if
structure:
if($input=~/^[Yy]es\b/
{ print "Let us play!\n" }
else
{ print "Okay. Thanks anyway.\n" }
[Yy]
matches either Y
or y
, so the whole search matches Yes
or yes
.
You could also use /i
to ignore case, so
/^yes\b/i
would match Yes
, yes
, YEs
, yES
, yeS
, and YES
.
Perl Escape Characters
The following sectioncontains a list of Perl Escape characters. The ones that are the most immediately useful are the position escape characters and the
character-range escape characters. You will also find a list of non-alphanumeric escape characters at the end of this page.
Positions
Character | Value |
\b | on a word boundary (i.e. either at the start or end of a word depending on where the escape character appears in the regular expression) |
\B | inside a word boundary (i.e. only if a pattern is contained in a word) |
\A | beginning of string |
\Z | end of string |
\G | where previous m//g left off |
Example
Now let's apply the word boundary escape character to our yes/no
if
structure:
if($input=~/^yes\b/i
{ print "Let us play!\n" }
else
{ print "Okay. Thanks anyway.\n" }
The \b
tells the script to look for the end of a word or a word boundary after yes
. This will only work for yes
. It allows us to ignore any words longer than yes
. In other words, y
and yesterday
would fail.
Character classes
Character | Value |
\w | any alphanumeric (a..z, A..Z, 0..9, _) |
\W | any nonalphanumeric |
\s | any whitespace (\n \r \f \t \x20) |
\S | any non-whitespace |
\d | any digit |
\D | any nondigit |
Non-alphanumerics
Character | Value |
\nnn | an ASCII value in octal |
\xnn | an ASCII value in hexadecimal |
\cx | ASCII control-x |
\n | newline (\x0a ) |
\r | carriage-return (\x0d ) |
\f | form-feed (\x0c ) |
\t | tab |
\a | alarm (beep \x07 ) |
\e | escape (\x1b ) |
\\ | backslash (\x5c ) |
Regular Expressions
- The backslash (
\
) character is used to create special escape characters for matching some nonalphanumerics and classes of characters.
- The period (
.
) matches any character (except \n
). To match a period itself, use \.
or [.]
.
- Alternate matches
can be specified using
|
to separate them.
- Within a pattern, you can specify subpatterns for later reference by enclosing them in parenthesis. You can refer to those subpatterns later by using
\
n where the n refers back to the nth subpattern. These are called back-references.
- You can repeat a pattern several times by following a character, class, or parenthesized expression with one of
these quantifiers.
Perl Back References
In the following example, I created a subpattern of the
W from my first name to use in matching the
W in my last name:
/([Ww])illiam \1einman/
William Weinman
will match
william weinman
will match
William weinman
will not match
If expressions in parenthesis matched a capital
W, then
\1
will only match another capital
W, and vice versa.
- Relative backreferences:
Counting the opening parentheses to get the correct number for a backreference is error-prone as soon as there is more than one capturing group.
A more convenient technique became available with Perl 5.10: relative backreferences. To refer to the immediately preceding capture group one now may write \g{-1} , the next but last is available via \g{-2} , and so on. Another good reason in addition to readability and maintainability for using relative backreferences
is illustrated by the following example, where a simple pattern for matching peculiar strings is used:
$a99a = '([a-z])(\d)\g2\g1'; # matches a11a, g22g, x33x, etc.
Now that we have this pattern stored as a handy string, we might feel tempted to use it as a part of some other pattern:
$line = "code=e99e";
if ($line =~ /^(\w+)=$a99a$/){ # unexpected behavior!
print "$1 is valid\n";
} else {
print "bad line: '$line'\n";
}
But this doesn't match, at least not the way one might expect. Only after inserting the interpolated $a99a and looking at the resulting full text of the regexp is it obvious that the backreferences have backfired. The subexpression (\w+) has snatched number 1 and demoted the groups in $a99a by one rank. This can be avoided by using relative backreferences:
$a99a = '([a-z])(\d)\g{-1}\g{-2}'; # safe for being interpolated
Perl Repeat Quantifiers
Here is a list of the repeat quantifiers:
Character | Value |
? | zero times or one time |
* | zero or more times |
+ | one or more times |
{x} | exactly x times |
{x,y} | x to y times |
{x,} | x or more times |
For example,
/(foo )?fi fum/
# matches foo fi fum or fi fum
/^\s+/
# matches a line with leading
# whitespace
/(\bif\b){2,}/
# matches a line with repeated ifs
All of these quantifiers are
greedy by default. That is, they will match the maximum number of characters that will not break the expression.
You can change any of them to become
nongreedy by using a
?
immediately after the quantifier. For example,
/^(.*)\s.*/
This will put all of the characters up to the last whitespace in the subexpression.
/^(.*?)\s.*/
This will put all of the characters up to the first whitespace in the subexpression.
The concept of greediness will become more important when you learn about the substitution operator later in this module.
- Using repeat quantifiers:
Now that we have more tools, let us apply repeat quantifiers to our Yes/no
if
structure:
if($input=~/^[Yy](es)?\b/
{ print "Let us play!\n" }
else
{ print "Okay. Thanks anyway.\n" }
In this example, (es)?
matches "es" in an expression 0 or 1 times.
So the whole expression would match Y
, y
, Yes
, or yes
.
Notice how much cleaner this is than the "or" structure you looked in the alternative matching example:
if($input=~/^[Yy](es)?\b/
versus
if($input=~/^([Yy])|([Yy])es\b/
Advanced Perl Programming
Perl Spell Check Name - Exercise