Class NfaBuilder<T>
A builder for non-deterministic finite automata along the lines of Thompson's Construction and a compilation method to get a deterministic finite automaton.
This builder works as a stack machine to allow building of automata to be combined into larger automata by operations on the top elements of the automata-stack.
An NFA is defined by a start state and a last state, both of which are used
as hooks for operations on NFA, for example concatenating the top stack elements on the stack
basically adds an epsilon transition from the last state of the 2nd stack element to the start
state of the top stack element. Most useful are NFA where there is a path from start to last, but
this is not formally required or enforced. In fact matchNothing() creates a disconnected
NFA.
Recognizing Nfa vs. Matching Dfa
Each state of an NFA may store a value of the generic type. Values can be added with
addValue(T) to the last state of the top NFA on the stack. Values are kept throughout
automaton operations. During compilation of the NFA, sets of states are combined into a single
DFA state representing them. Since a DFA state can also only store one value, at this point the
merger defined in the constructor or with setStateValueMerger(monq.jfa.StateValueMerger<T>) is invoked to create a
single state from the potentially multiple values found in the set of NFA states.
To describe operations, below we use the term "recognize" to describe that, if the plain NFA being created would be used, it would recognize a given string in the classical sense that a path can be found from start to last state. As described above, though, this does not necessarily mean the same as "matching" with the compiled Dfa, depending on which states have values and whether they are actually stop indicating values. Nevertheless, method names tend to start with "match", historically.
Example
Once compiled into a Dfa, only states with values can be accepting (or stop) states of
the automaton, depending on the predicate set with Dfa.withIsStopValue(Predicate). This
means that a last state is not necessarily an accepting state. Only when a value is
added to it and the isStopValue predicate is true, it represents an
accepting state in the resulting DFA. On the other hand, a last state loses its last state status
in most automaton operations, whether it has a value or not.
Example: After compilation into a Dfa, the Dfa, by default, treats
non-null values as stop states. With
var builder = new NfaBuilder<String>()
you may write
builder.matchRegex("[a-z]+").addValue("TEXT").matchRegex("[0-9]+").addValue("TEXTNUM").seq()
to get an automaton which matches the regular expression "[a-z]+" and maps it to the
value "TEXT", but also matches "[a-z]+[0-9]+" and maps it to the value
"TEXTNUM", whereby the longest match is used. To match either one instead of the
concatenation, use or() instead of seq().
To actually use the automaton for matching, compile it with compileToDfa() and use
the Dfa's functions or a DfaFilter.
With FaToDot you may visualize the graph of an automaton using
graphviz.
Invariants
The following invariants are maintained by operations on the NFA on the stack:- the start state has no incoming transitions
- the stop state has no outgoing transitions
-
Constructor Summary
ConstructorsConstructorDescriptionNfaBuilder(CharUnaryOperator chMap) Calls the full constructor with anReClassicParserusing the given character mapping.NfaBuilder(ReParser reParser, StateValueMerger<T> merger, double memoryForSpeedTradeFactor) Creates the builder. -
Method Summary
Modifier and TypeMethodDescriptionaddNonMatches(T prefixValue, T otherValue) Extends the top NFA on the stack to match (nearly) everything.Adds a value to the last state of the top NFA.This makes sure every prefix, except the empty string, of a string recognized by the top NFA on the stack is also recognized.and()Computes the automaton which recognizes the intersection of the languages recognized by the top two automata on the stack.and(CharSequence regex) Shortcut combining a call tomatchRegex(java.lang.CharSequence)andand().and(CharSequence regex, T value) clear()Clears the stack.Pops the top NFA from the stack, compiles it and returns the resulting DFA.Inverts the automaton, such that the resulting automaton recognizes the set-complement of the set of strings recognized by the given automaton.CallsFaToDot.xprint(java.lang.String, monq.jfa.NfaState<T>, java.util.Set<monq.jfa.NfaState<T>>)with the top NFA on the stack.Escapes special characters according to the rules of the currentReParserconfigured.voidescape(StringBuilder out, CharSequence in, int startAt) Escapes special characters according to the rules of the currentReParserconfigured, appending the output to the givenStringBuilder.intCallsfindPath(NfaState, NfaState, String, Set)with the the top NFA on the stack.intCallsfindPath(NfaState, NfaState, String, Set)with the the top NFA on the stack.intAn inefficient way to find a path through an NFA, starting at the given state.doublematchCharset(CharSequence pairs, boolean invert) Pushes an automaton recognizing a character set encoded as pairs of characters.Pushes an automaton that recognizes the empty string.Pushes an automaton that recognizes nothing.matchRegex(CharSequence regex) Uses the parser configured in theconstructorto parse the regular expression and leave the resulting NFA on the stack.matchRegex(CharSequence regex, T value) Shortcut combining a call tomatchRegex(CharSequence)andaddValue(T).Pushes an automaton which recognizes exactly the sequence of characters, which may be empty.matchString(CharSequence s, T value) Shortcut combiningmatchString(CharSequence)andaddValue(T).If the top NFA on the stack is equivalent to re, we docomplement("(.*re.*)?").optional()Transforms the top finite automaton to recognize the empty string too.or()Combines the top two NFA on the stack into one NFA matching strings either one of them matches.or(CharSequence regex) Shortcut combining a call tomatchRegex(java.lang.CharSequence)andor().or(CharSequence regex, T value) plus()Transforms the top NFA on the stack to recognize 1 or more copies of its matches, the operation typically provided by the"+"operator of regular expressions.pop()Pops the top NFA from the stack.pushCopy()Creates a copy of the structure of top NFA on the stack.LikepushCopy(Nfa).Creates a copy of the automaton starting atnfa.start()and pushes it.Pushes a copy of the top stack automaton yet without any values.seq()Combines the top two NFA on the stack into one NFA recognizing the sequence of the two individual matches: if the last two elements on the stack recognize, respectively, the stringssandt, the combined automaton recognizesst.seq(CharSequence regex) Shortcut combining a call tomatchRegex(CharSequence)andseq().seq(CharSequence regex, T value) voidsetMemoryForSpeedTradeFactor(float f) when creating DFA transitions during NFA to DFA compilation, the transition tables use different implementations, depending on how dense the transition table is.voidsetStateValueMerger(StateValueMerger<T> merger) The merger is needed when an NFA is compiled into a DFA.shortest()Transforms the top NFA on the stack into one which only recognizes the shortest matches of the given one.Returns the characters with special meaning for the currently configuredReParser.star()Applies the Kleene closure (*) operator to the top NFA on the stack.swap()Swaps the top 2 elements on the stack of automata.
-
Constructor Details
-
NfaBuilder
public NfaBuilder() -
NfaBuilder
Calls the full constructor with anReClassicParserusing the given character mapping.- Parameters:
chMap- to use for regular expression parsing
-
NfaBuilder
Creates the builder.- Parameters:
reParser- used in methods which accept a regular expressionmerger- used whenever an operation on the stack compiles the NFA into a DFA as an intermediate for final step, seesetStateValueMerger(StateValueMerger)memoryForSpeedTradeFactor- seesetMemoryForSpeedTradeFactor(float)
-
-
Method Details
-
setStateValueMerger
The merger is needed when an NFA is compiled into a DFA. Several values added with
addValue(T)may end up in the same state of a DFA during compilation. The merger then decides how to merge them into one value. -
getStateValueMerger
-
setMemoryForSpeedTradeFactor
public void setMemoryForSpeedTradeFactor(float f) when creating DFA transitions during NFA to DFA compilation, the transition tables use different implementations, depending on how dense the transition table is. The fastest transition table uses an array that is indexed by the transition character. For sparse tables, this is a significant memory overhead, therefore a denser table can be used which uses a binary search for lookup.
The factor given defines how much larger the fast table may be, before the denser table is used. If the factor is 1.0, the implementation with the smaller estimated memory footprint is used. The larger the factor is, the more likely is it that even sparse tables still use the faster implementation. With a factor of several thousand, nearly all transition tables will be fast, possibly using up a lot of heap space.
The default is 1.0. -
getMemoryForSpeedTradeFactor
public double getMemoryForSpeedTradeFactor() -
clear
Clears the stack. -
matchNothing
Pushes an automaton that recognizes nothing.
-
matchEpsilon
Pushes an automaton that recognizes the empty string.
-
matchString
Shortcut combiningmatchString(CharSequence)andaddValue(T). -
matchString
Pushes an automaton which recognizes exactly the sequence of characters, which may be empty. -
matchCharset
Pushes an automaton recognizing a character set encoded as pairs of characters. For example to recognize all lower case characters and all digits, the input would contain "az09".This method is mostly intended for the use by a regular expression parser. Normally one may just use
matchRegex("[a-z0-9]").- Parameters:
pairs- must have an even number of characters such that each pair of characters encodes an inclusive character range. In particular we must have a<=b.invert- if true, the character range is inverted
-
matchRegex
Uses the parser configured in theconstructorto parse the regular expression and leave the resulting NFA on the stack.- Parameters:
regex- to transform into an NFA- Throws:
ReSyntaxException- if the parser throws it
-
matchRegex
Shortcut combining a call tomatchRegex(CharSequence)andaddValue(T).- Throws:
ReSyntaxException
-
optional
Transforms the top finite automaton to recognize the empty string too. All actions defining stop states are kept as they are. -
plus
Transforms the top NFA on the stack to recognize 1 or more copies of its matches, the operation typically provided by the
"+"operator of regular expressions. -
star
Applies the Kleene closure (
*) operator to the top NFA on the stack. -
shortest
Transforms the top NFA on the stack into one which only recognizes the shortest matches of the given one. This operation is best understood by its implementation, which compiles the NFA into a DFA and then removes all outgoing links of any stop states, where "stop state" of the DFA is defined as a state representing a set of NFA states containing the last state.- Throws:
CompileDfaException- if the automaton can not be compiled.
-
complement
Inverts the automaton, such that the resulting automaton recognizes the set-complement of the set of strings recognized by the given automaton.
Stored values are not touched, though states storing them may be pruned when removing dead states. Insofar it is hardly useful to store values before inverting an automaton, as it may be difficult to understand
Note: Except for rare circumstances you rather want to use
notContained()instead. The behavior of this method, while theoretically sound, does not implement the intuitive request match everything but X for some regular expression X, because it matches all string which contain X like "aX" or "bXb".- Returns:
this- Throws:
CompileDfaException- if the automaton can not be compiled.
-
notContained
If the top NFA on the stack is equivalent to re, we docomplement("(.*re.*)?"). The resulting NFA neither recognizes any string containingrenor the empty string.The comment about "store values" for
complement()applies here too.Making the automaton optional with
"?"before computing the complement is slightly arbitrary. It prevents the result to match the empty string, which is often not intended. Further it is easy to add the empty string to the set of matching strings withoptional()than to remove the empty string match from the result.- Throws:
CompileDfaException- this operation requires compilation into a DFA which may throw this exception. SeecompileToDfa().
-
allPrefixes
This makes sure every prefix, except the empty string, of a string recognized by the top NFA on the stack is also recognized. If the NFA already matches the empty string, though, this is not changed.Although the empty string is technically a valid prefix match of all automata except those matching nothing, it is not added. This is because it is much easier to add it with
optional()than to remove an unwanted empty match.Technically, the automaton is compiled and then all last states get an epsilon transition to a new common last state. Further all non-last states also get an epsilon transition to the new last state.
In effect, all true prefixes (except the empty string) are also recognized.
Hint: If you want to differentiate in a compiled Dfa between the prefixes and the full strings recognized, first add a value, for example
"FULL"to the automaton. Then callallPrefixesand then add another value, for example"PREFIX". Finally usesetStateValueMerger(StateValueMerger)to define a merger which prefers"FULL"over"PREFIX", which may be easy to do withPrioritizedValueMerger.- Returns:
this- Throws:
CompileDfaException- if the NFA can not be compiled
-
addNonMatches
Extends the top NFA on the stack to match (nearly) everything. Suppose you want to filter a text for a regular expressionre=[a-z]+X, that is a string of lower-case characters followed by a singleX. Further suppose there are many strings matching[a-z]+but not followed by anX. Each time theDfafinds a prefix match but then fails to see theX, it has to push back the characters into the input and signal and fail. A loop around the matching, say ofDfaFilter, must then read one character, handle it somehow and callDfa.match(monq.jfa.CharSource, java.lang.StringBuilder)again, having made only one character progress and we know already that the next match attempts will fail too.This method does three things:
- It fills the automaton with matches for the non-empty prefixes and adds a custom value that may signal that it was just a prefix match.
- Duplicate and call
notContained()on the result of the above. This part matches everything which is neitherrenor a prefix of it and it can also get a custom value attached to signal this "other stuff" part. - Finally the two automata produced above are combined with
or().
The resulting automaton matches (nearly) everything somehow and retrieves a value
- signalling either a match to
re, - to a prefix of it or
- to some string which is not contained either of the above.
- Parameters:
prefixValue- to add to the automaton after callingallPrefixesotherValue- to add to matches which are neitherrenor prefix of it- Throws:
CompileDfaException
-
seq
Combines the top two NFA on the stack into one NFA recognizing the sequence of the two individual matches: if the last two elements on the stack recognize, respectively, the stringssandt, the combined automaton recognizesst. -
seq
Shortcut combining a call tomatchRegex(CharSequence)andseq().- Throws:
ReSyntaxException
-
seq
- Throws:
ReSyntaxException
-
or
Combines the top two NFA on the stack into one NFA matching strings either one of them matches. Put another way, the resulting automaton recognizes the union of the languages recognized by the input automata. -
or
Shortcut combining a call tomatchRegex(java.lang.CharSequence)andor().- Throws:
ReSyntaxException
-
or
- Throws:
ReSyntaxException
-
and
Computes the automaton which recognizes the intersection of the languages recognized by the top two automata on the stack.Implementation Note: This requires compilation. Consider two automata starting with START1-a->X and START2-ɛ->R-a->T. The intersection automaton would contain (START1,START2)-a->(X,T). But to figure this out, we need an epsilon closure of START2, so this would to a great part repeat compilation code anyway.
The state of the resulting automaton have a value if both states from which it was derived have a value and the value merger returns non
null.- Throws:
CompileDfaException- See Also:
-
and
Shortcut combining a call tomatchRegex(java.lang.CharSequence)andand().- Throws:
ReSyntaxExceptionCompileDfaException
-
and
- Throws:
ReSyntaxExceptionCompileDfaException
-
addValue
Adds a value to the last state of the top NFA. If a value is set already, a new last state with the given value is added with an epsilon from the current last state.
Note:If the automaton has already one or more values (i.e. stop states), the added value may conflict with some of those and the conflict must be resolved by the merger passed to the constructor as soon as an operation compiles the NFA.
Why is there no
setValue(...)?An automaton may have values set on several states. If, say, state A of those has an epsilon path to the last state, setting a value on the last state does not remove or invalidate the value set on A. Rather it adds to the values associated with A and it is the task of the merger passed to the constructor to merge those into a single value.
- Parameters:
value- may not benull
-
pushGraphCopy
Pushes a copy of the top stack automaton yet without any values. All values are removed from the copy. -
pushCopy
Creates a copy of the structure of top NFA on the stack. This works likepushCopy(Nfa), see there. -
pushCopy
Creates a copy of the automaton starting atnfa.start()and pushes it. The copy ofnfa.last()is used as the last state of the copied NFA. Only the structure of the automaton is copied, i.e. values are copied as references, not as deep copiesd. Compare this withList.addAll()which does not copy the list elements, only the list structure.HINT: If there is no path (may involve epsilon transitions) from
starttolast, further operations may have non-intuitive results. For example.matchString("...").seq()is effectively a no-op, because.seq()operates on the non-connectedlaststate.- Parameters:
nfa- to copy
-
pushCopy
LikepushCopy(Nfa). All copies of states deemed stop states by theDfa.getIsStopValue()are linked via epsilon transition to a new common last state of the resulting NFA. -
swap
Swaps the top 2 elements on the stack of automata. -
pop
Pops the top NFA from the stack. -
dotPrint
CallsFaToDot.xprint(java.lang.String, monq.jfa.NfaState<T>, java.util.Set<monq.jfa.NfaState<T>>)with the top NFA on the stack. -
findPath
CallsfindPath(NfaState, NfaState, String, Set)with the the top NFA on the stack. -
findPath
CallsfindPath(NfaState, NfaState, String, Set)with the the top NFA on the stack. -
findPath
An inefficient way to find a path through an NFA, starting at the given state. This is mostly used to test NFA structure.- Parameters:
start- where to startlast- if this state is in an epsilon closure we meet on our way, the epsilon closure "recognizes" the text read so fars- to recognizeterminal- is the set of states reached by traversing s to its end, independent of whether we hit a last state set or not- Returns:
- denotes the longest prefix of the given string which leads to a recognizing state, in particular 0, if the start state is a recognizing state. If no recognizing state is ever reached traversing the Nfa along the given string, -1 is returned.
-
compileToDfa
Pops the top NFA from the stack, compiles it and returns the resulting DFA.If the NFA does not have values, the resulting DFA's
Dfa.getAtStart(monq.jfa.CharSource)will not match anything: whileNfaBuildertreats the last state of an NFA as a recognizing state, theDfamethods cannot do this because aDfadoes not have a last state. Rather it uses itsDfa.getIsStopValue()predicate to determine if a state is a recognizing (aka stop) state.- Returns:
- the compile
Dfa - Throws:
CompileDfaException
-
escape
Escapes special characters according to the rules of the current
ReParserconfigured. -
escape
Escapes special characters according to the rules of the current
ReParserconfigured, appending the output to the givenStringBuilder. -
specialChars
Returns the characters with special meaning for the currently configuredReParser.
-