monq.jfa.NfaBuilder<T>

public final class NfaBuilder<T> extends Object

A builder for non-deterministic finite automata along the lines of Thompson's Construction and a compilation method to get a deterministic finite automaton.

This builder works as a stack machine to allow building of automata to be combined into larger automata by operations on the top elements of the automata-stack.

An NFA is defined by a start state and a last state, both of which are used as hooks for operations on NFA, for example concatenating the top stack elements on the stack basically adds an epsilon transition from the last state of the 2nd stack element to the start state of the top stack element. Most useful are NFA where there is a path from start to last, but this is not formally required or enforced. In fact matchNothing() creates a disconnected NFA.

Recognizing Nfa vs. Matching Dfa

Each state of an NFA may store a value of the generic type. Values can be added with addValue(T) to the last state of the top NFA on the stack. Values are kept throughout automaton operations. During compilation of the NFA, sets of states are combined into a single DFA state representing them. Since a DFA state can also only store one value, at this point the merger defined in the constructor or with setStateValueMerger(monq.jfa.StateValueMerger<T>) is invoked to create a single state from the potentially multiple values found in the set of NFA states.

To describe operations, below we use the term "recognize" to describe that, if the plain NFA being created would be used, it would recognize a given string in the classical sense that a path can be found from start to last state. As described above, though, this does not necessarily mean the same as "matching" with the compiled Dfa, depending on which states have values and whether they are actually stop indicating values. Nevertheless, method names tend to start with "match", historically.

Example

Once compiled into a Dfa, only states with values can be accepting (or stop) states of the automaton, depending on the predicate set with Dfa.withIsStopValue(Predicate). This means that a last state is not necessarily an accepting state. Only when a value is added to it and the isStopValue predicate is true, it represents an accepting state in the resulting DFA. On the other hand, a last state loses its last state status in most automaton operations, whether it has a value or not.

Example: After compilation into a Dfa, the Dfa, by default, treats non-null values as stop states. With

   var builder = new NfaBuilder<String>()

you may write

 builder.matchRegex("[a-z]+").addValue("TEXT").matchRegex("[0-9]+").addValue("TEXTNUM").seq()

to get an automaton which matches the regular expression "[a-z]+" and maps it to the value "TEXT", but also matches "[a-z]+[0-9]+" and maps it to the value "TEXTNUM", whereby the longest match is used. To match either one instead of the concatenation, use or() instead of seq().

To actually use the automaton for matching, compile it with compileToDfa() and use the Dfa's functions or a DfaFilter.

With FaToDot you may visualize the graph of an automaton using graphviz.

Invariants

The following invariants are maintained by operations on the NFA on the stack:

the start state has no incoming transitions
the stop state has no outgoing transitions

Constructor Summary

Constructors

Constructor

Description

NfaBuilder()

Calls the other constructor with ReClassicParser, StateValueMerger.strict() and 1.0.

NfaBuilder(CharUnaryOperator chMap)

Calls the full constructor with an ReClassicParser using the given character mapping.

NfaBuilder(ReParser reParser, StateValueMerger<T> merger, double memoryForSpeedTradeFactor)

Creates the builder.
Method Summary

Modifier and Type

Method

Description

NfaBuilder<T>

addNonMatches(T prefixValue, T otherValue)

Extends the top NFA on the stack to match (nearly) everything.

NfaBuilder<T>

addValue(T value)

Adds a value to the last state of the top NFA.

NfaBuilder<T>

allPrefixes()

This makes sure every prefix, except the empty string, of a string recognized by the top NFA on the stack is also recognized.

NfaBuilder<T>

and()

Computes the automaton which recognizes the intersection of the languages recognized by the top two automata on the stack.

NfaBuilder<T>

and(CharSequence regex)

Shortcut combining a call to matchRegex(java.lang.CharSequence) and and().

NfaBuilder<T>

and(CharSequence regex, T value)

Shortcut combining a call to matchRegex(java.lang.CharSequence), addValue(T) and and().

NfaBuilder<T>

clear()

Clears the stack.

Dfa<T>

compileToDfa()

Pops the top NFA from the stack, compiles it and returns the resulting DFA.

NfaBuilder<T>

complement()

Inverts the automaton, such that the resulting automaton recognizes the set-complement of the set of strings recognized by the given automaton.

NfaBuilder<T>

dotPrint(String filename)

Calls FaToDot.xprint(java.lang.String, monq.jfa.NfaState<T>, java.util.Set<monq.jfa.NfaState<T>>) with the top NFA on the stack.

String

escape(String s)

Escapes special characters according to the rules of the current ReParser configured.

void

escape(StringBuilder out, CharSequence in, int startAt)

Escapes special characters according to the rules of the current ReParser configured, appending the output to the given StringBuilder.

int

findPath(String text)

Calls findPath(NfaState, NfaState, String, Set) with the the top NFA on the stack.

int

findPath(String text, Set<NfaState<T>> terminalStates)

Calls findPath(NfaState, NfaState, String, Set) with the the top NFA on the stack.

int

findPath(NfaState<T> start, NfaState<T> last, String s, Set<NfaState<T>> terminal)

An inefficient way to find a path through an NFA, starting at the given state.

double

getMemoryForSpeedTradeFactor()

StateValueMerger<T>

getStateValueMerger()

NfaBuilder<T>

matchCharset(CharSequence pairs, boolean invert)

Pushes an automaton recognizing a character set encoded as pairs of characters.

NfaBuilder<T>

matchEpsilon()

Pushes an automaton that recognizes the empty string.

NfaBuilder<T>

matchNothing()

Pushes an automaton that recognizes nothing.

NfaBuilder<T>

matchRegex(CharSequence regex)

Uses the parser configured in the constructor to parse the regular expression and leave the resulting NFA on the stack.

NfaBuilder<T>

matchRegex(CharSequence regex, T value)

Shortcut combining a call to matchRegex(CharSequence) and addValue(T).

NfaBuilder<T>

matchString(CharSequence s)

Pushes an automaton which recognizes exactly the sequence of characters, which may be empty.

NfaBuilder<T>

matchString(CharSequence s, T value)

Shortcut combining matchString(CharSequence) and addValue(T).

NfaBuilder<T>

notContained()

If the top NFA on the stack is equivalent to re, we do complement("(.*re.*)?").

NfaBuilder<T>

optional()

Transforms the top finite automaton to recognize the empty string too.

NfaBuilder<T>

or()

Combines the top two NFA on the stack into one NFA matching strings either one of them matches.

NfaBuilder<T>

or(CharSequence regex)

Shortcut combining a call to matchRegex(java.lang.CharSequence) and or().

NfaBuilder<T>

or(CharSequence regex, T value)

Shortcut combining a call to matchRegex(java.lang.CharSequence), addValue(T) and or().

NfaBuilder<T>

plus()

Transforms the top NFA on the stack to recognize 1 or more copies of its matches, the operation typically provided by the "+" operator of regular expressions.

Nfa<T>

pop()

Pops the top NFA from the stack.

NfaBuilder<T>

pushCopy()

Creates a copy of the structure of top NFA on the stack.

NfaBuilder<T>

pushCopy(Dfa<T> dfa)

Like pushCopy(Nfa).

NfaBuilder<T>

pushCopy(Nfa<T> nfa)

Creates a copy of the automaton starting at nfa.start() and pushes it.

NfaBuilder<T>

pushGraphCopy()

Pushes a copy of the top stack automaton yet without any values.

NfaBuilder<T>

seq()

Combines the top two NFA on the stack into one NFA recognizing the sequence of the two individual matches: if the last two elements on the stack recognize, respectively, the strings s and t, the combined automaton recognizes st.

NfaBuilder<T>

seq(CharSequence regex)

Shortcut combining a call to matchRegex(CharSequence) and seq().

NfaBuilder<T>

seq(CharSequence regex, T value)

Shortcut combining a call to matchRegex(CharSequence), addValue(T) and seq().

void

setMemoryForSpeedTradeFactor(float f)

when creating DFA transitions during NFA to DFA compilation, the transition tables use different implementations, depending on how dense the transition table is.

void

setStateValueMerger(StateValueMerger<T> merger)

The merger is needed when an NFA is compiled into a DFA.

NfaBuilder<T>

shortest()

Transforms the top NFA on the stack into one which only recognizes the shortest matches of the given one.

String

specialChars()

Returns the characters with special meaning for the currently configured ReParser.

NfaBuilder<T>

star()

Applies the Kleene closure (*) operator to the top NFA on the stack.

NfaBuilder<T>

swap()

Swaps the top 2 elements on the stack of automata.

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- NfaBuilder
  
  public NfaBuilder()
  
  Calls the other constructor with ReClassicParser, StateValueMerger.strict() and 1.0.
- NfaBuilder
  
  public NfaBuilder(CharUnaryOperator chMap)
  
  Calls the full constructor with an ReClassicParser using the given character mapping.
  
  Parameters:
  
  chMap - to use for regular expression parsing
- NfaBuilder
  
  public NfaBuilder(ReParser reParser, StateValueMerger<T> merger, double memoryForSpeedTradeFactor)
  
  Creates the builder.
  
  Parameters:
  
  reParser - used in methods which accept a regular expression
  
  merger - used whenever an operation on the stack compiles the NFA into a DFA as an intermediate for final step, see setStateValueMerger(StateValueMerger)
  
  memoryForSpeedTradeFactor - see setMemoryForSpeedTradeFactor(float)
Method Details
- setStateValueMerger
  
  public void setStateValueMerger(StateValueMerger<T> merger)
  
  The merger is needed when an NFA is compiled into a DFA. Several values added with addValue(T) may end up in the same state of a DFA during compilation. The merger then decides how to merge them into one value.
- getStateValueMerger
  
  public StateValueMerger<T> getStateValueMerger()
- setMemoryForSpeedTradeFactor
  
  public void setMemoryForSpeedTradeFactor(float f)
  
  when creating DFA transitions during NFA to DFA compilation, the transition tables use different implementations, depending on how dense the transition table is. The fastest transition table uses an array that is indexed by the transition character. For sparse tables, this is a significant memory overhead, therefore a denser table can be used which uses a binary search for lookup.
  
  The factor given defines how much larger the fast table may be, before the denser table is used. If the factor is 1.0, the implementation with the smaller estimated memory footprint is used. The larger the factor is, the more likely is it that even sparse tables still use the faster implementation. With a factor of several thousand, nearly all transition tables will be fast, possibly using up a lot of heap space.
  The default is 1.0.
- getMemoryForSpeedTradeFactor
  
  public double getMemoryForSpeedTradeFactor()
- clear
  
  public NfaBuilder<T> clear()
  
  Clears the stack.
- matchNothing
  
  public NfaBuilder<T> matchNothing()
  
  Pushes an automaton that recognizes nothing.
- matchEpsilon
  
  public NfaBuilder<T> matchEpsilon()
  
  Pushes an automaton that recognizes the empty string.
- matchString
  
  public NfaBuilder<T> matchString(CharSequence s, T value)
  
  Shortcut combining matchString(CharSequence) and addValue(T).
- matchString
  
  public NfaBuilder<T> matchString(CharSequence s)
  
  Pushes an automaton which recognizes exactly the sequence of characters, which may be empty.
- matchCharset
  
  public NfaBuilder<T> matchCharset(CharSequence pairs, boolean invert)
  
  Pushes an automaton recognizing a character set encoded as pairs of characters. For example to recognize all lower case characters and all digits, the input would contain "az09".
  This method is mostly intended for the use by a regular expression parser. Normally one may just use matchRegex("[a-z0-9]").
  
  Parameters:
  
  pairs - must have an even number of characters such that each pair of characters encodes an inclusive character range. In particular we must have a<=b.
  
  invert - if true, the character range is inverted
- matchRegex
  
  public NfaBuilder<T> matchRegex(CharSequence regex) throws ReSyntaxException
  
  Uses the parser configured in the constructor to parse the regular expression and leave the resulting NFA on the stack.
  
  Parameters:
  
  regex - to transform into an NFA
  
  Throws:
  
  ReSyntaxException - if the parser throws it
- matchRegex
  
  public NfaBuilder<T> matchRegex(CharSequence regex, T value) throws ReSyntaxException
  
  Shortcut combining a call to matchRegex(CharSequence) and addValue(T).
  
  Throws:
  
  ReSyntaxException
- optional
  
  public NfaBuilder<T> optional()
  
  Transforms the top finite automaton to recognize the empty string too. All actions defining stop states are kept as they are.
- plus
  
  public NfaBuilder<T> plus()
  
  Transforms the top NFA on the stack to recognize 1 or more copies of its matches, the operation typically provided by the "+" operator of regular expressions.
- star
  
  public NfaBuilder<T> star()
  
  Applies the Kleene closure (*) operator to the top NFA on the stack.
- shortest
  
  public NfaBuilder<T> shortest() throws CompileDfaException
  
  Transforms the top NFA on the stack into one which only recognizes the shortest matches of the given one. This operation is best understood by its implementation, which compiles the NFA into a DFA and then removes all outgoing links of any stop states, where "stop state" of the DFA is defined as a state representing a set of NFA states containing the last state.
  
  Throws:
  
  CompileDfaException - if the automaton can not be compiled.
- complement
  
  public NfaBuilder<T> complement() throws CompileDfaException
  
  Inverts the automaton, such that the resulting automaton recognizes the set-complement of the set of strings recognized by the given automaton.
  
  Stored values are not touched, though states storing them may be pruned when removing dead states. Insofar it is hardly useful to store values before inverting an automaton, as it may be difficult to understand
  
  Note: Except for rare circumstances you rather want to use notContained() instead. The behavior of this method, while theoretically sound, does not implement the intuitive request match everything but X for some regular expression X, because it matches all string which contain X like "aX" or "bXb".
  
  Returns:
  
  this
  
  Throws:
  
  CompileDfaException - if the automaton can not be compiled.
- notContained
  
  public NfaBuilder<T> notContained() throws CompileDfaException
  
  If the top NFA on the stack is equivalent to re, we do complement("(.*re.*)?"). The resulting NFA neither recognizes any string containing re nor the empty string.
  The comment about "store values" for complement() applies here too.
  
  Making the automaton optional with "?" before computing the complement is slightly arbitrary. It prevents the result to match the empty string, which is often not intended. Further it is easy to add the empty string to the set of matching strings with optional() than to remove the empty string match from the result.
  
  Throws:
  
  CompileDfaException - this operation requires compilation into a DFA which may throw this exception. See compileToDfa().
- allPrefixes
  
  public NfaBuilder<T> allPrefixes() throws CompileDfaException
  
  This makes sure every prefix, except the empty string, of a string recognized by the top NFA on the stack is also recognized. If the NFA already matches the empty string, though, this is not changed.
  Although the empty string is technically a valid prefix match of all automata except those matching nothing, it is not added. This is because it is much easier to add it with optional() than to remove an unwanted empty match.
  
  Technically, the automaton is compiled and then all last states get an epsilon transition to a new common last state. Further all non-last states also get an epsilon transition to the new last state.
  
  In effect, all true prefixes (except the empty string) are also recognized.
  
  Hint: If you want to differentiate in a compiled Dfa between the prefixes and the full strings recognized, first add a value, for example "FULL" to the automaton. Then call allPrefixes and then add another value, for example "PREFIX". Finally use setStateValueMerger(StateValueMerger) to define a merger which prefers "FULL" over "PREFIX", which may be easy to do with PrioritizedValueMerger.
  
  Returns:
  
  this
  
  Throws:
  
  CompileDfaException - if the NFA can not be compiled
- addNonMatches
  
  public NfaBuilder<T> addNonMatches(T prefixValue, T otherValue) throws CompileDfaException
  Extends the top NFA on the stack to match (nearly) everything. Suppose you want to filter a text for a regular expression re=[a-z]+X, that is a string of lower-case characters followed by a single X. Further suppose there are many strings matching [a-z]+ but not followed by an X. Each time the Dfa finds a prefix match but then fails to see the X, it has to push back the characters into the input and signal and fail. A loop around the matching, say of DfaFilter, must then read one character, handle it somehow and call Dfa.match(monq.jfa.CharSource, java.lang.StringBuilder) again, having made only one character progress and we know already that the next match attempts will fail too.
  This method does three things:
  
  It fills the automaton with matches for the non-empty prefixes and adds a custom value that may signal that it was just a prefix match.
  
  Duplicate and call notContained() on the result of the above. This part matches everything which is neither re nor a prefix of it and it can also get a custom value attached to signal this "other stuff" part.
  
  Finally the two automata produced above are combined with or().
  
  The resulting automaton matches (nearly) everything somehow and retrieves a value
  
  signalling either a match to re,
  
  to a prefix of it or
  
  to some string which is not contained either of the above.
  
  When filtering long text, this can be much faster than the original automaton if long partial matches are frequent. What "long" and "frequent" mean depends on so many factors, that it requires testing. It depends not the least on the specific regular expression as well as the text structure. A prediction is not possible.
  Parameters:
  
  prefixValue - to add to the automaton after calling allPrefixes
  
  otherValue - to add to matches which are neither re nor prefix of it
  
  Throws:
  
  CompileDfaException
- seq
  
  public NfaBuilder<T> seq()
  
  Combines the top two NFA on the stack into one NFA recognizing the sequence of the two individual matches: if the last two elements on the stack recognize, respectively, the strings s and t, the combined automaton recognizes st.
- seq
  
  public NfaBuilder<T> seq(CharSequence regex) throws ReSyntaxException
  
  Shortcut combining a call to matchRegex(CharSequence) and seq().
  
  Throws:
  
  ReSyntaxException
- seq
  
  public NfaBuilder<T> seq(CharSequence regex, T value) throws ReSyntaxException
  
  Shortcut combining a call to matchRegex(CharSequence), addValue(T) and seq().
  
  Throws:
  
  ReSyntaxException
- or
  
  public NfaBuilder<T> or()
  
  Combines the top two NFA on the stack into one NFA matching strings either one of them matches. Put another way, the resulting automaton recognizes the union of the languages recognized by the input automata.
- or
  
  public NfaBuilder<T> or(CharSequence regex) throws ReSyntaxException
  
  Shortcut combining a call to matchRegex(java.lang.CharSequence) and or().
  
  Throws:
  
  ReSyntaxException
- or
  
  public NfaBuilder<T> or(CharSequence regex, T value) throws ReSyntaxException
  
  Shortcut combining a call to matchRegex(java.lang.CharSequence), addValue(T) and or().
  
  Throws:
  
  ReSyntaxException
- and
  
  public NfaBuilder<T> and() throws CompileDfaException
  
  Computes the automaton which recognizes the intersection of the languages recognized by the top two automata on the stack.
  Implementation Note: This requires compilation. Consider two automata starting with START1-a->X and START2-ɛ->R-a->T. The intersection automaton would contain (START1,START2)-a->(X,T). But to figure this out, we need an epsilon closure of START2, so this would to a great part repeat compilation code anyway.
  
  The state of the resulting automaton have a value if both states from which it was derived have a value and the value merger returns non null.
  Throws:
  
  CompileDfaException
  
  See Also:
  
  state complexity of regular language intersection
- and
  
  public NfaBuilder<T> and(CharSequence regex) throws ReSyntaxException, CompileDfaException
  
  Shortcut combining a call to matchRegex(java.lang.CharSequence) and and().
  
  Throws:
  
  ReSyntaxException
  
  CompileDfaException
- and
  
  public NfaBuilder<T> and(CharSequence regex, T value) throws ReSyntaxException, CompileDfaException
  
  Shortcut combining a call to matchRegex(java.lang.CharSequence), addValue(T) and and().
  
  Throws:
  
  ReSyntaxException
  
  CompileDfaException
- addValue
  
  public NfaBuilder<T> addValue(T value)
  
  Adds a value to the last state of the top NFA. If a value is set already, a new last state with the given value is added with an epsilon from the current last state.
  
  Note:If the automaton has already one or more values (i.e. stop states), the added value may conflict with some of those and the conflict must be resolved by the merger passed to the constructor as soon as an operation compiles the NFA.
  
  Why is there no setValue(...)?
  
  An automaton may have values set on several states. If, say, state A of those has an epsilon path to the last state, setting a value on the last state does not remove or invalidate the value set on A. Rather it adds to the values associated with A and it is the task of the merger passed to the constructor to merge those into a single value.
  
  Parameters:
  
  value - may not be null
- pushGraphCopy
  
  public NfaBuilder<T> pushGraphCopy()
  
  Pushes a copy of the top stack automaton yet without any values. All values are removed from the copy.
- pushCopy
  
  public NfaBuilder<T> pushCopy()
  
  Creates a copy of the structure of top NFA on the stack. This works like pushCopy(Nfa), see there.
- pushCopy
  
  public NfaBuilder<T> pushCopy(Nfa<T> nfa)
  
  Creates a copy of the automaton starting at nfa.start() and pushes it. The copy of nfa.last() is used as the last state of the copied NFA. Only the structure of the automaton is copied, i.e. values are copied as references, not as deep copiesd. Compare this with List.addAll() which does not copy the list elements, only the list structure.
  HINT: If there is no path (may involve epsilon transitions) from start to last, further operations may have non-intuitive results. For example .matchString("...").seq() is effectively a no-op, because .seq() operates on the non-connected last state.
  
  Parameters:
  
  nfa - to copy
- pushCopy
  
  public NfaBuilder<T> pushCopy(Dfa<T> dfa)
  
  Like pushCopy(Nfa). All copies of states deemed stop states by the Dfa.getIsStopValue() are linked via epsilon transition to a new common last state of the resulting NFA.
- swap
  
  public NfaBuilder<T> swap()
  
  Swaps the top 2 elements on the stack of automata.
- pop
  
  public Nfa<T> pop()
  
  Pops the top NFA from the stack.
- dotPrint
  
  public NfaBuilder<T> dotPrint(String filename)
  
  Calls FaToDot.xprint(java.lang.String, monq.jfa.NfaState<T>, java.util.Set<monq.jfa.NfaState<T>>) with the top NFA on the stack.
- findPath
  
  public int findPath(String text)
  
  Calls findPath(NfaState, NfaState, String, Set) with the the top NFA on the stack.
- findPath
  
  public int findPath(String text, Set<NfaState<T>> terminalStates)
  
  Calls findPath(NfaState, NfaState, String, Set) with the the top NFA on the stack.
- findPath
  
  public int findPath(NfaState<T> start, NfaState<T> last, String s, Set<NfaState<T>> terminal)
  
  An inefficient way to find a path through an NFA, starting at the given state. This is mostly used to test NFA structure.
  
  Parameters:
  
  start - where to start
  
  last - if this state is in an epsilon closure we meet on our way, the epsilon closure "recognizes" the text read so far
  
  s - to recognize
  
  terminal - is the set of states reached by traversing s to its end, independent of whether we hit a last state set or not
  
  Returns:
  
  denotes the longest prefix of the given string which leads to a recognizing state, in particular 0, if the start state is a recognizing state. If no recognizing state is ever reached traversing the Nfa along the given string, -1 is returned.
- compileToDfa
  
  public Dfa<T> compileToDfa() throws CompileDfaException
  
  Pops the top NFA from the stack, compiles it and returns the resulting DFA.
  If the NFA does not have values, the resulting DFA's Dfa.getAtStart(monq.jfa.CharSource) will not match anything: while NfaBuilder treats the last state of an NFA as a recognizing state, the Dfa methods cannot do this because a Dfa does not have a last state. Rather it uses its Dfa.getIsStopValue() predicate to determine if a state is a recognizing (aka stop) state.
  
  Returns:
  
  the compile Dfa
  
  Throws:
  
  CompileDfaException
- escape
  
  public String escape(String s)
  
  Escapes special characters according to the rules of the current ReParser configured.
- escape
  
  public void escape(StringBuilder out, CharSequence in, int startAt)
  
  Escapes special characters according to the rules of the current ReParser configured, appending the output to the given StringBuilder.
- specialChars
  
  public String specialChars()
  
  Returns the characters with special meaning for the currently configured ReParser.

Class NfaBuilder<T>

Recognizing Nfa vs. Matching Dfa

Example

Invariants

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

NfaBuilder

NfaBuilder

NfaBuilder

Method Details

setStateValueMerger

getStateValueMerger

setMemoryForSpeedTradeFactor

getMemoryForSpeedTradeFactor

clear

matchNothing

matchEpsilon

matchString

matchString

matchCharset

matchRegex

matchRegex

optional

plus

star

shortest

complement

notContained

allPrefixes

addNonMatches

seq

seq

seq

or

or

or

and

and

and

addValue

Why is there no setValue(...)?

pushGraphCopy

pushCopy

pushCopy

pushCopy

swap

pop

dotPrint

findPath

findPath

findPath

compileToDfa

escape

escape

specialChars

Why is there no `setValue(...)`?