Lesson 3	Regular expression reference
Objective	Write a regular expression that will catch most common misspellings of your name.

Regular Expression Reference

Perl regular expressions are based on the standard egrep-style (so-called version 8) regexps. These regexes perform pattern matching based on a set of rules. The basic set of rules are explained in this lesson. For the purpose of the examples in this discussion, we will use the simple form of Perl's pattern-matching

operator (m//):

For a review on the matching operator, see The match operator lesson from Module 3.

The basic set of rules for egrep-style (extended) regular expressions, which Perl's regular expressions are based on, includes the following elements:

Literal Characters: Most characters match themselves. For example, the regex a matches the character 'a'.
Metacharacters: Certain characters have special meanings and are used to define the structure of the regex. These include:
- . (dot): Matches any single character except a newline.
- ^ (caret): Anchors the match to the start of the string.
- $ (dollar): Anchors the match to the end of the string.
- | (pipe): Acts as an OR operator. For example, a|b matches 'a' or 'b'.
- () (parentheses): Groups expressions together and captures the text matched by the enclosed pattern.
- (square brackets): Defines a character class, matching any one of the characters inside. For example, [abc] matches 'a', 'b', or 'c'.
- {} (curly braces): Specifies a specific number of occurrences. For example, a{3} matches 'aaa'.
- * (asterisk): Matches zero or more occurrences of the preceding element.
- + (plus): Matches one or more occurrences of the preceding element.
- ? (question mark): Matches zero or one occurrence of the preceding element.
Character Classes:
- [abc]: Matches any one of the characters 'a', 'b', or 'c'.
- [^abc]: Matches any character except 'a', 'b', or 'c'.
- [a-z]: Matches any character in the range from 'a' to 'z'.
- \d: Matches any digit (equivalent to ).
- \D: Matches any non-digit character.
- \w: Matches any word character (equivalent to [a-zA-Z0-9_]).
- \W: Matches any non-word character.
- \s: Matches any whitespace character (space, tab, newline, etc.).
- \S: Matches any non-whitespace character.
Quantifiers:
- *: Matches zero or more occurrences of the preceding element.
- +: Matches one or more occurrences of the preceding element.
- ?: Matches zero or one occurrence of the preceding element.
- {n}: Matches exactly n occurrences of the preceding element.
- {n,}: Matches n or more occurrences of the preceding element.
- {n,m}: Matches between n and m occurrences of the preceding element.
Anchors:
- ^: Asserts the position at the start of the string.
- $: Asserts the position at the end of the string.
- \b: Asserts a word boundary.
- \B: Asserts a position that is not a word boundary.
Escaping: To match a metacharacter literally, you need to escape it with a backslash (\). For example, to match a literal dot, you would use \..
Grouping and Capturing:
- (abc): Groups the characters 'abc' and captures the matched text.
- (?:abc): Groups the characters 'abc' but does not capture the matched text (non-capturing group).
Alternation:
- a|b: Matches either 'a' or 'b'.
Backreferences:
- \1, \2, etc.: Refers to the text captured by the first, second, etc., capturing group.

These rules form the foundation of egrep-style regular expressions, which are extended and enhanced in Perl to provide even more powerful pattern-matching capabilities.

Pattern - Matching Rules

There are a lot of details in this lesson that we will be using later on in the module. Be sure to read each of the paragraphs below as well as the linked pages from this lesson. In addition, we will apply the regular expressions discussed to the yes/no if structure we examined in the previous lesson. Any single character matches itself, unless it is one of the recognized metacharacters.

Perl Metacharacters Example:
Note: By now you have noticed that some characters in regexes have a special meaning. These are called metacharacters. The following are the metacharacters that Perl regular expressions recognize:
```
{} [] () ^ $ . | * + ? \
```
If you want to match the literal version of any of those characters, you must precede them with a backslash, \. As you go through the chapter, the meaning of these metacharacters will become clear. These are the recognized metacharacters:
```
+ ? . * % $ ( ) [ ] { } | \
```
For example,
```
/$15/
```
will not match this pattern:
Can I borrow $15?

If any of the metacharacters are present in your expression, and you are specifically looking for that character, you will need to escape it in order to have it included in your results. In the above search example, use:
/$15/
You can also use special metacharacters to match the beginning or end of a line or string .
Perl Special Metacharacters:
The following special metacharacters have these special meanings:
1. ^ matches the beginning of the line or string.
2. $ matches the end of the line or string.
When the special metacharacter ^ is used outside of a bracketed character class, it means "the beginning of a line or string." However, when ^ is used inside a bracketed character class, it negates the immediately following character or group of characters. Here is an example of how you would apply the special metacharacters to our yes/no if structure:
```
if($input=~/^[Yy](es)?$/)
   { print "Let's play!\n" }
else
   { print "Okay. Thanks anyway.\n" }
```
Let us examine the regular expression:
```
~/^[Yy](es)?$/
```
The special metacharacter ^ matches the beginning of the string.
1. [Yy] matches 1 character from a set of either Y or y.
2. (es)? matches a pattern of es either 0 or 1 times.
3. $ matches the end of a string.
Brackets are used to create your own class of characters.
Matching a Class of Characters:
Here are some examples of matching a class of characters:
1. [A-Z] will match any uppercase character
2. [0-9] will match any digit
3. [Nn]o will match No or no
A negative class (anything except the class) can be created by using the ^ character.
1. [^A-Z] will match anything except an uppercase character
2. [^0-9] will match any nondigit
Now let us apply this to our yes/no if structure:
```
if($input=~/^[Yy]es\b/
   { print "Let us play!\n" }
else
   { print "Okay. Thanks anyway.\n" }
```
[Yy] matches either Y or y, so the whole search matches Yes or yes.
You could also use /i to ignore case, so
```
/^yes\b/i
```
would match Yes, yes, YEs, yES, yeS, and YES.

Perl Escape Characters

The following sectioncontains a list of Perl Escape characters. The ones that are the most immediately useful are the position escape characters and the character-range escape characters. You will also find a list of non-alphanumeric escape characters at the end of this page.
Positions

Character	Value
`\b`	on a word boundary (i.e. either at the start or end of a word depending on where the escape character appears in the regular expression)
`\B`	inside a word boundary (i.e. only if a pattern is contained in a word)
`\A`	beginning of string
`\Z`	end of string
`\G`	where previous m//g left off

Example
Now let's apply the word boundary escape character to our yes/no if structure:

if($input=~/^yes\b/i
{ print "Let us play!\n" }
else
{ print "Okay. Thanks anyway.\n" }

The \b tells the script to look for the end of a word or a word boundary after yes. This will only work for yes. It allows us to ignore any words longer than yes. In other words, y and yesterday would fail.

Character classes

Character	Value
`\w`	any alphanumeric (a..z, A..Z, 0..9, _)
`\W`	any nonalphanumeric
`\s`	any whitespace (\n \r \f \t \x20)
`\S`	any non-whitespace
`\d`	any digit
`\D`	any nondigit

Non-alphanumerics

Character	Value
`\nnn`	an ASCII value in octal
`\xnn`	an ASCII value in hexadecimal
`\cx`	ASCII control-x
`\n`	newline (`\x0a`)
`\r`	carriage-return (`\x0d`)
`\f`	form-feed (`\x0c`)
`\t`	tab
`\a`	alarm (`beep \x07`)
`\e`	escape (`\x1b`)
`\\`	backslash (`\x5c`)

Regular Expressions

The backslash (\) character is used to create special escape characters for matching some nonalphanumerics and classes of characters.
The period (.) matches any character (except \n). To match a period itself, use \. or [.].
Alternate matches can be specified using | to separate them.
Within a pattern, you can specify subpatterns for later reference by enclosing them in parenthesis. You can refer to those subpatterns later by using \n where the n refers back to the nth subpattern. These are called back-references.
You can repeat a pattern several times by following a character, class, or parenthesized expression with one of these quantifiers.

Perl Back References

In the following example, I created a subpattern of the W from my first name to use in matching the W in my last name:

/([Ww])illiam \1einman/

William Weinman will match william weinman will match William weinman will not match
If expressions in parenthesis matched a capital W, then \1 will only match another capital W, and vice versa.

Relative backreferences:
Counting the opening parentheses to get the correct number for a backreference is error-prone as soon as there is more than one capturing group. A more convenient technique became available with Perl 5.10: relative backreferences. To refer to the immediately preceding capture group one now may write \g{-1} , the next but last is available via \g{-2} , and so on. Another good reason in addition to readability and maintainability for using relative backreferences is illustrated by the following example, where a simple pattern for matching peculiar strings is used:
```
$a99a = '([a-z])(\d)\g2\g1';   # matches a11a, g22g, x33x, etc.
```
Now that we have this pattern stored as a handy string, we might feel tempted to use it as a part of some other pattern:
```
$line = "code=e99e";
if ($line =~ /^(\w+)=$a99a$/){   # unexpected behavior!
 print "$1 is valid\n";
} else {
 print "bad line: '$line'\n";
}
```
But this doesn't match, at least not the way one might expect. Only after inserting the interpolated $a99a and looking at the resulting full text of the regexp is it obvious that the backreferences have backfired. The subexpression (\w+) has snatched number 1 and demoted the groups in $a99a by one rank. This can be avoided by using relative backreferences:
```
$a99a = '([a-z])(\d)\g{-1}\g{-2}';  # safe for being interpolated
```

Perl Repeat Quantifiers

Here is a list of the repeat quantifiers:

Character	Value
`?`	zero times or one time
`*`	zero or more times
`+`	one or more times
`{x}`	exactly x times
`{x,y}`	x to y times
`{x,}`	x or more times

For example,

/(foo )?fi fum/
 # matches foo fi fum or fi fum
 
/^\s+/
 # matches a line with leading
 # whitespace
 
/(\bif\b){2,}/
 # matches a line with repeated ifs

All of these quantifiers are greedy by default. That is, they will match the maximum number of characters that will not break the expression. You can change any of them to become nongreedy by using a ? immediately after the quantifier. For example,

/^(.*)\s.*/

This will put all of the characters up to the last whitespace in the subexpression.

/^(.*?)\s.*/

This will put all of the characters up to the first whitespace in the subexpression. The concept of greediness will become more important when you learn about the substitution operator later in this module.

Using repeat quantifiers: Now that we have more tools, let us apply repeat quantifiers to our Yes/no if structure:
```
if($input=~/^[Yy](es)?\b/
   { print "Let us play!\n" }
else
   { print "Okay. Thanks anyway.\n" }
```
In this example, (es)? matches "es" in an expression 0 or 1 times. So the whole expression would match Y, y, Yes, or yes. Notice how much cleaner this is than the "or" structure you looked in the alternative matching example:
```
if($input=~/^[Yy](es)?\b/
```
versus
```
if($input=~/^([Yy])|([Yy])es\b/
```

Advanced Perl Programming

Perl Spell Check Name - Exercise

Click the exercise link below to write a regular expression that will catch misspellings of your name.
Perl Spell Check Name - Exercise