`ReClassicParser`'s regular expression syntax

The regular expression syntax employed by ReClassicParser is similar but not the same as that used in Java, Perl, Python, Tcl etc. The two reasons for this are that it sacrifices some features for speed and that it contains operators which are not known in standard implementations: complement(), notContained() and shortest().

Regular expressions are built up from atoms and operators which combine atoms into larger regular expressions.

Atoms

c

A single character is a regular expression matching exactly this character. Characters with a special meaning in the regular expression syntax must be escaped as described below. Since Java is UNICODE based, a character can be any UNICODE character.

\c

If c is in the set {\, b, t, n, f, r, s, ", '} or is a digit from '0' to '7' inclusive, the same translation is done as with String.translateEscapes(). This does not include the \<line-terminator> rule mentioned there. Note how this translation goes together with what the Java compiler does. If you write new NfaBuilder().matchRegex("\n"), the Java compiler will already provide a string with one line-feed character. If you however write new NfaBuilder().matchRegex("\\n"), the Java compiler will pass in a two character string with backslash and the character 'n'. The result is the same after parsing by the regular expression parser, a regular expression matching a line-feed. The translation is provided in the parser for cases where the regular expression is obtained, for example, from a command line argument or a file.

If c is 'u' or 'x', the following 4 respectively 2 characters must be valid hexadecimal characters. They are converted to an integer value and cast to a char.

For all other characters, a character escaped by a preceding backslash matches that character, even if that character otherwise has a special meaning in the regular expression syntax. Note that two consecutive backslash characters are needed in constant strings within Java programs. To construct an Nfa which matches the open bracket, write

  new NfaBuilder<>().matchRegex("\\[")

[...]

A set of characters enclosed in brackets matches any single character mentioned in the set. Within the brackets, a range of characters is denoted by the first and the last character of the range separated by the dash, e.g. [a-z]. If the caret "^" is the first character within the brackets, the character set is inverted, i.e. it will match any character not in the set. To include the right bracket, dash or caret in the set, they must be preceeded by a backslash. (Two backslashes in constant strings in a Java program, because one backslash is eaten up by the compiler.)

(re)

A regular expression re enclosed in parentheses matches exactly whatever the re matches.

Operators

re?: A regular expression re followed by the question mark matches re or the empty string.
re+: A regular expression re followed by the plus sign matches one or more occurences of re.
re*: A regular expression re followed by the asterisk matches zero or more occurences of re.
re{n} re{n,} re{n,m}: A regular expression followed by one of the shown range specifications matches exactly n, at least n or between n and m (inclusive) occurences of re. The range may not specify the empty string, e.g. {0} or {0,0}. Further n≤m is required.
Note: A Dfa cannot count. Therefore the Dfa for re is internally replicated up to m times. For a large Dfa or large m or even both, this will use a lot of memory.
re!: A regular expression re followed by the exclamation mark matches the shortest match satisfying re. This operator is particularly useful to jump to the first occurence of some string. For example the expression "(.*</b>)!" matches everything up to and including the first "</b>" found. A comparison with the non-greedy "*" operator available in other regular expression engines can be found below.
re~: A regular expression re followed by the tilde matches any string which does not match re. This can be counter-intuitive, because (abc)~ will match the following strings: "ab", "abcd", "xxabcd" and so on. In fact every string which is not exactly "abc" will match. Internally, this calls complement().
re^: A regular expression re followed by the hat (caret) matches any string which does not contain a match of re. Internally, this calls notContained(). The tilde operator is a convenience shortcut for "((.*re.*)?)~". If in doubt whether to use re~ or re^, you probably want re^.
re@: A regular expression re followed by "@" matches all strings that re matches, as well as all non-empty prefixes of these. Put another way, all non-empty prefix matches are added. See NfaBuilder.allPrefixes().
re₁re₂: matches all strings which match re₁ immediately followed by a match of re₂.
re₁|re₂: matches all strings which match either re₁ or re₂.
re₁&re₂: matches all strings which match both, re₁ and re₂. The operator '&' binds stronger than '|', so a&b|c is the same as (a&b)|c.

Non Greedy Matching vs. Shortest Match

The difference is best explained by an example. When trying to match a full XML element followed by a certain context, one may be tempted to write

  <tag>.*?</tag><otherTag>

employing the non-greedy operator "*?" available in java.util.regex. However, non-greedy operators sacrifice the shortest possible match for an overall match of the regular expression if necessary. Consequently the above expression would match the text

  <tag>bla</tag><somestuff>...</somestuff><tag>xxx</tag><otherTag>

just because the longer match satisfies the regular expression, while stopping at the first </tag> would not match.

In contrast, the shortest match operator as implemented by jfa does not give up the shortest possible match of a subexpression to allow the whole expression to match. Consequently,

  <tag>(.*</tag>)!<otherTag>

would not match at the beginning of the above string.

ReClassicParser's regular expression syntax

Atoms

Operators

Non Greedy Matching vs. Shortest Match

`ReClassicParser`'s regular expression syntax