Regex Mathching

Basic Outline

DFA-based regex engines

NFA-based regex engines

Simulating NFAs for string matching

Greedy Metacharacters

The greedy principle:
items that are allowed to match a multiple number of times always attempt to match as much as possible.
Examples (perl)
  1. $line =~ m/^Subject: (.*)
  2. $line =~ m/^(Re: )*(.*)
  3. $line =~ m/^.*[0-9]+
  4. $line =~ m/^.*[0-9][0-9]

Alternation

The issue here is what order the potential matches are checked. For the metacharacters, one checks greedily (matches before non-matches) and backtracks only when forced to do so. For alternation, there are some choices to be made. The following example illustrates the issue: So in a traditional NFA, the user has some extra control over how matching strings are selected (at least among the leftmost candidates). This can also lead to surprises if one isn't careful.

The effects of these choices are compounded in regexes like "(\\.|[^"\\])"* and "([^"\\]|\\.)*" which match the same strings (quoted strings with internal quotes allowed if escaped), but can have very different backtracking patterns if the alternation is chosen in left to right order. (Try them on the string "2\"x3\" likeness".)

POSIX NFAs

A locally greedy algorithm (for matching Kleene star and alternation) does not necessarily yield a globally longest string, as the following (silly) example illustrates: Let $line contain "abracadabra", then can match a longer substring if "ab" is selected at the first alternation.

As a result, a POSIX NFA must pretty much search all possible paths in the NFA. If there are many paths, this will take a long time. A traditional NFA may stop when it locates its first match ("abracad" in the example above if the alternation is greedy). Thus a POSIX NFA will typically be slower than a traditional NFA when there are many matches. In this case, the traditional NFA finds one early and stops, and the POSIX NFA keeps on checking.

Of course, both kinds of NFAs search exhaustively if there is no match. For this reason, some implementations use DFAs first to detect a match or when there is no reason to use an NFA, then subsequently run an NFA when there is a match and they need the extra features (like subexpression trapping). NFAs must be used to check for backreferences.

Exercises

  1. Let $line contain
            "val = foo(bar(this), that) + (the * other);"
    		
    What do each of the following regexes match? (Does it matter what flavor they are?)
    1. \(.*\)
    2. \([^)]*\)
    3. \([^()]*\)
    Note that \( is being used to get literal parentheses.
  2. You should have noticed that none the regexes above correctly finds the arguments to foo. In fact, a regex is not capable of handling arbitrarily deep nesting. Show this by proving that language consisting of all ascii strings which begin with an opening parenthsis, end with a closing parenthesis and have all internal parentheses properly nested, is not a regular language.

    An asside: If you set a maximum depth of nesting, it is possible to do a regex search, but the regex is pretty ugly and if you are not careful, the running time (NFA) or compile time (DFA) could be very bad. We will learn other better methods for parentheses matching later in the course.


This page maintained by:
Randall Pruim
Department of Mathematics and Statistics
Calvin College
rpruim@calvin.edu

Last Modified: Wednesday, 20-Feb-2002 14:39:50 EST