Regular Expression
Syntax



   


Regular Expressions (supported by class RE - see Regular Expression Matching Using Finite Automata) have a simple syntax described here in Backus-Naur Form (BNF):

            <re> ::= <expression> { <expression> }
                   | <re> '|' <re>
 
    <expression> ::= <term>
                   | <term> '?'
                   | <term> '+'
                   | <term> '*'
 
          <term> ::= <label>
                   | '(' <re> ')'
 
         <label> ::= <symbol>
                   | '[' <range> { <range> } ']'
                   | '[' ']' { <range> } ']'
                   | '[' '^' <range> { <range> } ']'
                   | '[' '^' ']' { <range> } ']'
 
         <range> ::= <symbol>
                   | <symbol> '-' <symbol>
 
        <symbol> ::= '.'
                   | 0 .. n (any element of alphabet)
                   | '\' <symbol>
          

 

Mastering Regular Expressions

      

<re>


A Regular Expression consists of:

  • 1 or more <expressions>s
    or
  • an <re>, the OR operator ('|'), and the alternate <re>.

Any two regular expressions A and B can be concatenated. The result is a regular expression which matches a string if A matches some amount of the beginning of that string and B matches the rest of the string. Note that there is not an explicit AND operator; adjacent regular expressions are concatenated. For example:

    "ab" ::= "a" "b"

The regular expression "ab" is constructed by concatenating the regular expression "a" with the regular expression "b" and matches a 'a' followed by a 'b'.

 

The '|' operator provides an alternative or branch function within the regular expression. For any 2 regular expression A and B, the regular expression A|B matches the regular expression matched by A or the regular expression matched by B. For example:

    "a|b"

matches 'a' or 'b'.

 

<expression>


An <expression> consists of a <term> or a <term> followed by one of the suffix operators:

  • '?' - optional,
  • '+' - repeatable,
    or
  • '*' - optional and repeatable.

 

* (optional and repeatable)


The '*' operator applies to the preceding regular expression which can be repeated 0 or more times. For example:

    "ab*"

matches 'a' or 'ab' or 'abb' or 'abbb' etc.

Note that if a '*' does not have a preceding regular expression it will be treated as an ordinary character. For example:

    "*a"

matches '*a'.

 

+ (repeatable)


The "+" operator applies to the preceding regular expression which can be repeated 1 or more times. For example:

    "ab+"

matches 'ab' or 'abb' or 'abbb' etc.

Unlike the previous example "ab+" does not match 'a'.

Note that if a '+' does not have a preceding regular expression it will be treated as an ordinary character. For example:

    "+a"

matches '+a'.

 

? (optional)


The "?" operator applies to the preceding regular expression which is optional (used 0 or 1 time). For example:

    "ab?"

matches 'a' or 'ab'.

Note that if a '?' does not have a preceding regular expression it will be treated as an ordinary character. For example:

    "?a"

matches "?a".

 

<term>


A <term> consists of a <label> or a "group" of regular expressions.

Parenthesis: '(' and ')' can be used to group regular expressions. For example:

    "a(ab)?"

Matches 'a' or 'aab'.

 

<label>


A <label> consists of a <symbol> or an non-empty character set.

 

[ ... ]


A '[' begins a character (or label) set that is terminated by a ']'. All characters between the '[' and ']' are added to the set. For example:

    "[aeiou]"

matches 'a' or 'e' or 'i' or 'o' or 'u'.

A "]" can be added to the set by making it the first character in the set. For example:

    "[]aeiou]"

matches ']' or 'a' or 'e' or 'i' or 'o' or 'u'.

versus

    "[aeio]u]"

which is the character set containing 'a', 'e', 'i' and 'o' followed by (concatenated with) 'u' and ']'. It matches "au]" or "eu]" or "iu]" or "ou]".

 

[^ ...]


If a '^' is the first character of a character set the set is complemented. That is, the set matches all characters except those specified in the set. For example:

    "[^aeiou]"

matches any single character except: 'a', 'e', 'i', 'o', or 'u'.

To add a ']' to the character set it must be the first character following the '^'. For example:

    "[^]aeiou]"

matches any single character except: ']', 'a', 'e', 'i', 'o', or 'u'.

A '^' is special only if it is the first character of the character set and will be included in the set if it is not the first character. For example:

    "[^^]"

matches all characters except '^'.

 

<range>


A character ranges can be specified in a character set, by two characters with a '-' between them. For example:

    "[a-z]"

matches any character in the range 'a' to 'z' including 'a' and 'z'; any lower case character.

The first character must be less than or equal to the second character. Otherwise the characters will be treated as 3 separate characters. For example:

    "[z-a]"

specifies the character set containing the characters 'a', '-', and 'z'.
Warning: this syntax is not recommended.

A character ranges may be intermixed with individual characters and other character ranges. For example:

    "[a-zA-Z0]"

or

    "[A-Z0a-z]"

matches all upper and lower case letters and the digit 0.

Note that the special characters '.' or the operators '+', '*', '?' and '|', are not special inside a character set. For example:

    "[.+*?|]"

matches '.' or '+' or '*' or '?' or '|'.

A '-' will be included in the character set if it is used in a context were it does not specify a range. For example:

    "[a-]"

    "[-a]"

are character sets that include 'a' and '-'.

A'-' can also be included by using "---" (the character range containing only a '-'). For example:

    "[a-z---A-Z]"

matches upper and lower case characters plus the character '-'.
Warning: this syntax is not recommended.

However:

    "[\0---Z]"

is the character set consisting of the character range from '\0' to '-' plus the characters '-' and 'Z'.

 

<symbol>


A <symbol> consists of the special character '.', or an escaped '\' character, or any element (character) of the alphabet: 0 -- max (127 for char).

 

.


A '.' matches any single character. For example: the regular expression

    "a.c"

matches any three character string which begins with 'a' and ends with 'c'.

 

\


Any character can be escaped; meaning that the following character will be NOT be treated specially. For example:

    "a\?"

matches 'a?'. Since the '?' is escaped it is not treated as suffix operator - QUESTION.

Otherwise, a <symbol> is a single character that matches itself. For example:

"a"

matches 'a'.

 



Last updated
Copyright © Donald R. Biggar.
dbiggar@sympatico.ca