![]() Simple Primitives and Expressions |
![]() Active Web |
![]() |
The simplest regular expression is a single character. It matches exactly that character. A sequence of characters matches a string with exactly the same sequence of characters:
'a' matches: 'a' -- true
'foobar' matches: 'foobar' -- true
'blorple' matches: 'foobar' -- false
The above paragraph introduced a primitive regular expression (a character), and an operator (sequencing). Operators are applied to regular expressions to produce more complex regular expressions. Sequencing (placing expressions one after another) as an operator is, in a certain sense, `invisible'--yet it is arguably the most common.
A more 'visible' operator is Kleene closure, more often simply referred to as 'a star'. A regular expression followed by an asterisk matches any number (including 0) of matches of the original expression. For example:
'ab' matches: 'a*b' -- true
'aaaaab' matches: 'a*b' -- true
'b' matches: 'a*b' -- true
'aac' matches: 'a*b' -- false: b does not match
A star's precedence is higher than that of sequencing. A star applies to the shortest possible sub expression that precedes it. For example, 'ab*' means 'a followed by zero or more occurrences of b', not 'zero or more occurrences of ab':
'abbb' matches: 'ab*' -- true
'abab' matches: 'ab*' -- false
To actually make a regular expression matching 'zero or more occurrences of ab', 'ab' is enclosed in parentheses:
'abab' matches: '(ab)*' -- true
'abcab' matches: '(ab)*' -- false: c does not match
Two other operators similar to '*' are '+' and '?'. '+' (positive closure, or simply 'plus') matches one or more occurrences of the original expression. '?' ('optional') matches zero or one, but never more, occurrences.
'ac' matches: 'ab*c' -- true
'ac' matches: 'ab+c' -- false: need at least one b
'abbc' matches: 'ab+c' -- true
'abbc' matches: 'ab?c' -- false: too many b's
As we have seen, characters '*', '+', '?', '(', and ')' have special meaning in regular expressions. If one of them is to be used literally, it should be quoted: preceded with a backslash. (Thus, backslash is also special character, and needs to be quoted for a literal match - as well as any other special characters described further).
'ab*' matches: 'ab*' -- false: star in the right string is special
'ab*' matches: 'ab\*' -- true
'a\c' matches: 'a\\c' -- true
The last operator is '|' meaning 'or'. It is placed between two regular expressions, and the resulting expression matches if one of the expressions matches. It has the lowest possible precedence (lower than sequencing). For example, 'ab*|ba*' means 'a followed by any number of b's, or b followed by any number of a's':
'abb' matches: 'ab*|ba*' -- true
'baa' matches: 'ab*|ba*' -- true
'baab' matches: 'ab*|ba*' -- false
A slightly more complex example is the following expression, matching the name of any of the Lisp-style 'car', 'cdr', 'caar', 'cadr', ... functions:
c(a|d)+r
It is possible to write an expression matching an empty string, for example: 'a|'. However, it is an error to apply '*', '+', or '?' to such expression: '(a|)*' is an invalid expression.
Topic ID: 150171