Regex Mathching
Basic Outline
- Convert regex into NFA or DFA representation
- simulate DFA or NFA repeatedly starting at each character in string
starting from the left
- guarantees a "leftmost" match
- still may be more than one "matching substring"
- POSIX standard: find "longest leftmost match"
- For very simple regex matching, there is very little difference
bewteen NFA and DFA.
- exmple: searching for "Fred"
- There are multiple search related processes
- boolean: is there a match? (grep)
- location: where is the match? (moving cursor in editors)
- substitution: change matching string to something else (perl)
DFA-based regex engines
- searching is fast (linear time per pass, O(n^2) worst case)
- search time depends on length of string, not on regex.
- in this sense it is "text driven" string matching
- takes more memory (state explosion in NFA to DFA construction) than NFA
- takes longer to compile the regex
- might be done when program is compiled
- might be done runtime (just before string matching is needed)
- some implementations even compile the regex in the midst
of matching (if a match is found before the entire DFA is constructed,
they can just stop)
- POSIX is easy
NFA-based regex engines
- searching can be slow (more on this later)
- search time depends on regex
- in this sense it is "regex driven" string matching
- takes less memory than DFA (how much depends on regex)
- compiles more quickly than DFA (how much depends on regex)
- need to simulate the NFA (i.e., explore paths)
- use backtracking algorithm to "try out the guesses"
- different flavors of backtracking give differenet performance
- when there are options, what order to we try them in?
- can add in "extras" with no serious costs
- sub-expression trapping
- back-references (non-regular!)
- POSIX takes more thought and requires a (potentially large)
performance hit
- non-POSIX implementations
- allow for more control of matches
- require more care to use
- need to know about implementation to know what will be matched
Simulating NFAs for string matching
Greedy Metacharacters
-
The greedy principle:
- items that are allowed to match a multiple number
of times always attempt to match as much as possible.
Examples (perl)
$line =~ m/^Subject: (.*)
- Note:
$1 now contains the "subject" as a
side-effect.
$line =~ m/^(Re: )*(.*)
$line =~ m/^.*[0-9]+
- fix:
$line =~ m/^[^0-9]*[0-9]+
- what would happen without the beginning of line anchor?
(
^)
$line =~ m/^.*[0-9][0-9]
- How does this match proceed?
Alternation
The issue here is what order the potential matches are checked.
For the metacharacters, one checks greedily (matches before non-matches)
and backtracks only when forced to do so. For alternation, there
are some choices to be made. The following example illustrates the
issue:
$line =~ m/(tour|to|tournament)
- what if $line is "three tournaments won"?
- POSIX: tournament, because it yields the longest global match
- greedy alternation: tournament, because it is longer
that other local choices
- traditional NFA: tour, because it comes first in alternation
list and is therefore tried first.
So in a traditional NFA, the user has some extra control over how
matching strings are selected (at least among the leftmost candidates).
This can also lead to surprises if one isn't careful.
The effects of these choices are compounded in regexes like
"(\\.|[^"\\])"* and
"([^"\\]|\\.)*"
which match the same strings (quoted strings with internal
quotes allowed if escaped), but can have very different
backtracking patterns if the alternation is chosen in left to right
order. (Try them on the string "2\"x3\" likeness".)
POSIX NFAs
A locally greedy algorithm (for matching Kleene star and
alternation) does not necessarily yield a globally longest string,
as the following (silly) example illustrates: Let $line
contain "abracadabra", then
$line =~ m/(ab|abra)(cad|racadabra)/
can match a longer substring if "ab" is selected at the first
alternation.
As a result, a POSIX NFA must pretty much search all possible paths
in the NFA. If there are many paths, this will take a long time.
A traditional NFA may stop when it locates its first match ("abracad"
in the example above if the alternation is greedy).
Thus a POSIX NFA will typically be slower than a traditional NFA when there
are many matches. In this case,
the traditional NFA finds one early and stops, and the POSIX NFA keeps
on checking.
Of course, both kinds of NFAs search exhaustively if
there is no match. For this reason, some implementations use
DFAs first to detect a match or when there is no reason to use
an NFA, then subsequently run an NFA when there is a match and they
need the extra features (like subexpression trapping). NFAs must
be used to check for backreferences.
Exercises
- Let
$line contain
"val = foo(bar(this), that) + (the * other);"
What do each of the following regexes match? (Does it matter
what flavor they are?)
-
\(.*\)
-
\([^)]*\)
-
\([^()]*\)
Note that \( is being used to get literal parentheses.
-
You should have noticed that none the regexes above correctly finds
the arguments to
foo. In fact, a regex is not capable
of handling arbitrarily deep nesting. Show this by proving that
language consisting of all ascii strings which begin with an
opening parenthsis, end with a closing parenthesis and have all
internal parentheses properly nested, is not a regular language.
An asside:
If you set a maximum depth of nesting, it is possible to
do a regex search, but the regex is pretty ugly and if
you are not careful, the running time (NFA) or compile time
(DFA) could be very bad. We will learn other better methods
for parentheses matching later in the course.
This page maintained by:
Randall Pruim
Department of Mathematics and Statistics
Calvin College
rpruim@calvin.edu
Last Modified:
Wednesday, 20-Feb-2002 14:39:50 EST