Regular Expression Matching
Using
Finite Automata (in C++)



   

C++ Template Classes RE, NFA & DFA

Template classes RE, NFA and DFA provide a pattern matching and searching facility for character (or other type) strings. Patterns are described by Regular Expressions (RE) written using the notation (syntax) described in Regular Expressions. A Non-deterministic Finite Automata (or just Finite Automata) (NFA) can be constructed from the regular expression, and, a Deterministic Finite Automata (DFA) can be constructed from the finite automata.


        #include "automata.h"
        #include "re.h"            // template<class BASE> class RE;
        #include "nfa.h"           // template<class BASE> class NFA;
        #include "dfa.h"           // template<class BASE> class DFA;
 
        typedef char lbase;
        lbase *test = "abbbc";
        RE<lbase> re("ab*(bc)*");
        NFA<lbase> nfa(re);        // see Figure 1
        DFA<lbase> dfa(nfa);       // see Figure 2
	

All template classes are parameterized to support other built in types: unsigned char, char, wchar_t, int, etc. Only char based regular expressions have been tested (see main.c; over 1000 test cases).

 

The C++ Programming Language

   


Figure 1: NFA

Note: Unlabeled transitions, called lambda or epsilon transitions match an empty or null string and thus can be used (taken) at any time.

 

DFA


Figure 2: DFA

 


Both the DFA and NFA can be used (run) to match a string. For example:

        lbase *nfa_result = nfa.Match(test);
        lbase *dfa_result = dfa.Match(test);

The ability to run the NFA falls out of this type of algorithm; theoretically interesting, useful for debugging, but not practical as it can be slow.

Running Match using a DFA can be significantly faster then back-tracking algorithms; a DFA Match should perform at least as well. Simple regular expressions (for example "abc") will require approximately the same amount of processing. More complicated regular expressions will require considerably less processing; a DFA Match can be orders of magnitude faster than back-tracking algorithms.

Match returns a pointer to the first element (character) of it's string argument (test) that was not examined. If the entire string is used then it will point to the NULL at the end of the string. Otherwise it will point 1 character past the character for which no transition was defined; 1 passed the character that resulted in an Invalid state.

Thus, if the pointer returned by Match points to the NULL at the end of the test string then the entire string was processed and may have been accepted by the automata. However, the final state of the automata must also be checked using member Valid (is the finite automata in a valid state) and Final (is the finite automata in a final or accept state).

        boolean nfa_matched = (*dfa_result == NULL) && nfa.Valid() && nfa.Final();
        boolean dfa_matched = (*dfa_result == NULL) && dfa.Valid() && dfa.Final();

Once a DFA (or NFA) is constructed it can be used repeatedly to match different strings. For example:


        lbase *testcases[] =
        {
                "a",
                "ab",
                "abc",
                "abcd",
        };
 
        DFA<lbase> dfa(NFA<lbase>(RE<lbase>("ab*(bc)*")));

        for(lbase **t = testcases; **t; t++)         {                 if((*dfa.Match(*t) == NULL) && dfa.Valid() && dfa.Final())                         ; // passed                 else                         ; // failed         }

Match is a small function (method) that uses members Restart (sets the current state to the start state, Next (advances the current state given an input), and the previously mentioned Valid member. Using these members Match functions can easily be written to process other input sources.

        virtual const lbase *Match(const lbase *str)
        {
                for(Restart(); *str;)
                {
                        Next(*str++);
                        if(!Valid())
                                break;
                }
                return(str);
        }


     


Last updated
Copyright © Donald R. Biggar.
dbiggar@sympatico.ca