ReClassicParser's regular expression syntax
The regular expression syntax employed by ReClassicParser is similar but not the same as that used in Java, Perl, Python, Tcl etc. The two reasons for this are that it sacrifices some features for speed and that it contains operators which are not known in standard implementations: complement(), notContained() and shortest().
Regular expressions are built up from atoms and operators which combine atoms into larger regular expressions.
Atoms
- c
- A single character is a regular expression matching exactly this character. Characters with a special meaning in the regular expression syntax must be escaped as described below. Since Java is UNICODE based, a character can be any UNICODE character.
- \c
If c is in the set
{\, b, t, n, f, r, s, ", '}or is a digit from '0' to '7' inclusive, the same translation is done as withString.translateEscapes(). This does not include the\<line-terminator>rule mentioned there. Note how this translation goes together with what the Java compiler does. If you writenew NfaBuilder().matchRegex("\n"), the Java compiler will already provide a string with one line-feed character. If you however writenew NfaBuilder().matchRegex("\\n"), the Java compiler will pass in a two character string with backslash and the character 'n'. The result is the same after parsing by the regular expression parser, a regular expression matching a line-feed. The translation is provided in the parser for cases where the regular expression is obtained, for example, from a command line argument or a file.If c is 'u' or 'x', the following 4 respectively 2 characters must be valid hexadecimal characters. They are converted to an integer value and cast to a
For all other characters, a character escaped by a preceding backslash matches that character, even if that character otherwise has a special meaning in the regular expression syntax. Note that two consecutive backslash characters are needed in constant strings within Java programs. To construct anchar.Nfawhich matches the open bracket, writenew NfaBuilder<>().matchRegex("\\[")- [...]
- A set of characters enclosed in brackets matches
any single character mentioned in the set. Within the
brackets, a range of characters is denoted by the first and
the last character of the range separated by the dash,
e.g.
[a-z]. If the caret"^"is the first character within the brackets, the character set is inverted, i.e. it will match any character not in the set. To include the right bracket, dash or caret in the set, they must be preceeded by a backslash. (Two backslashes in constant strings in a Java program, because one backslash is eaten up by the compiler.) - (re)
- A regular expression re enclosed in parentheses matches exactly whatever the re matches.
Operators
- re?
- A regular expression re followed by the question mark matches re or the empty string.
- re+
- A regular expression re followed by the plus sign matches one or more occurences of re.
- re*
- A regular expression re followed by the asterisk matches zero or more occurences of re.
- re{n} re{n,} re{n,m}
-
A regular expression followed by one of the shown range
specifications matches exactly n, at least n or between n and
m (inclusive) occurences of re. The range may not
specify the empty string, e.g. {0} or {0,0}. Further
n≤m is required.
Note: A Dfa cannot count. Therefore the Dfa for re is internally replicated up to m times. For a large Dfa or large m or even both, this will use a lot of memory. - re!
- A regular expression re followed by the exclamation
mark matches the shortest match satisfying re. This
operator is particularly useful to jump to the first
occurence of some string. For example the expression
"(.*</b>)!"matches everything up to and including the first"</b>"found. A comparison with the non-greedy"*"operator available in other regular expression engines can be found below. - re~
- A regular expression re followed by the tilde
matches any string which does not match
re. This can be counter-intuitive, because
(abc)~will match the following strings: "ab", "abcd", "xxabcd" and so on. In fact every string which is not exactly "abc" will match. Internally, this calls complement(). - re^
- A regular expression re followed by the hat (caret)
matches any string which does not contain a match of
re. Internally, this
calls notContained().
The tilde operator
is a convenience shortcut for
"((.*re.*)?)~". If in doubt whether to usere~orre^, you probably wantre^. - re@
- A regular expression re followed by
"@"matches all strings that re matches, as well as all non-empty prefixes of these. Put another way, all non-empty prefix matches are added. SeeNfaBuilder.allPrefixes(). - re1re2
- matches all strings which match re1 immediately followed by a match of re2.
- re1|re2
- matches all strings which match either re1 or re2.
- re1&re2
- matches all strings which match both,
re1 and re2. The operator
'&' binds stronger than '|', so
a&b|cis the same as(a&b)|c.
Non Greedy Matching vs. Shortest Match
The difference is best explained by an example. When trying to match a full XML element followed by a certain context, one may be tempted to write
<tag>.*?</tag><otherTag>
employing the non-greedy operator "*?" available
in java.util.regex. However, non-greedy operators
sacrifice the shortest possible match for an overall match of the
regular expression if necessary. Consequently the above expression
would match the text
<tag>bla</tag><somestuff>...</somestuff><tag>xxx</tag><otherTag>
just because the longer match satisfies the regular expression,
while stopping at the first </tag> would not
match.
In contrast, the shortest match operator as implemented
by jfa does not give up the shortest possible match
of a subexpression to allow the whole expression to
match. Consequently,
<tag>(.*</tag>)!<otherTag>
would not match at the beginning of the above string.