Package monq.jfa

Class NfaBuilder<T>

java.lang.Object
monq.jfa.NfaBuilder<T>

public final class NfaBuilder<T> extends Object

A builder for non-deterministic finite automata along the lines of Thompson's Construction and a compilation method to get a deterministic finite automaton.

This builder works as a stack machine to allow building of automata to be combined into larger automata by operations on the top elements of the automata-stack.

An NFA is defined by a start state and a last state, both of which are used as hooks for operations on NFA, for example concatenating the top stack elements on the stack basically adds an epsilon transition from the last state of the 2nd stack element to the start state of the top stack element. Most useful are NFA where there is a path from start to last, but this is not formally required or enforced. In fact matchNothing() creates a disconnected NFA.

Recognizing Nfa vs. Matching Dfa

Each state of an NFA may store a value of the generic type. Values can be added with addValue(T) to the last state of the top NFA on the stack. Values are kept throughout automaton operations. During compilation of the NFA, sets of states are combined into a single DFA state representing them. Since a DFA state can also only store one value, at this point the merger defined in the constructor or with setStateValueMerger(monq.jfa.StateValueMerger<T>) is invoked to create a single state from the potentially multiple values found in the set of NFA states.

To describe operations, below we use the term "recognize" to describe that, if the plain NFA being created would be used, it would recognize a given string in the classical sense that a path can be found from start to last state. As described above, though, this does not necessarily mean the same as "matching" with the compiled Dfa, depending on which states have values and whether they are actually stop indicating values. Nevertheless, method names tend to start with "match", historically.

Example

Once compiled into a Dfa, only states with values can be accepting (or stop) states of the automaton, depending on the predicate set with Dfa.withIsStopValue(Predicate). This means that a last state is not necessarily an accepting state. Only when a value is added to it and the isStopValue predicate is true, it represents an accepting state in the resulting DFA. On the other hand, a last state loses its last state status in most automaton operations, whether it has a value or not.

Example: After compilation into a Dfa, the Dfa, by default, treats non-null values as stop states. With

   var builder = new NfaBuilder<String>()
 

you may write

 builder.matchRegex("[a-z]+").addValue("TEXT").matchRegex("[0-9]+").addValue("TEXTNUM").seq()
 

to get an automaton which matches the regular expression "[a-z]+" and maps it to the value "TEXT", but also matches "[a-z]+[0-9]+" and maps it to the value "TEXTNUM", whereby the longest match is used. To match either one instead of the concatenation, use or() instead of seq().

To actually use the automaton for matching, compile it with compileToDfa() and use the Dfa's functions or a DfaFilter.

With FaToDot you may visualize the graph of an automaton using graphviz.

Invariants

The following invariants are maintained by operations on the NFA on the stack:
  • the start state has no incoming transitions
  • the stop state has no outgoing transitions
  • Constructor Details

  • Method Details

    • setStateValueMerger

      public void setStateValueMerger(StateValueMerger<T> merger)

      The merger is needed when an NFA is compiled into a DFA. Several values added with addValue(T) may end up in the same state of a DFA during compilation. The merger then decides how to merge them into one value.

    • getStateValueMerger

      public StateValueMerger<T> getStateValueMerger()
    • setMemoryForSpeedTradeFactor

      public void setMemoryForSpeedTradeFactor(float f)

      when creating DFA transitions during NFA to DFA compilation, the transition tables use different implementations, depending on how dense the transition table is. The fastest transition table uses an array that is indexed by the transition character. For sparse tables, this is a significant memory overhead, therefore a denser table can be used which uses a binary search for lookup.

      The factor given defines how much larger the fast table may be, before the denser table is used. If the factor is 1.0, the implementation with the smaller estimated memory footprint is used. The larger the factor is, the more likely is it that even sparse tables still use the faster implementation. With a factor of several thousand, nearly all transition tables will be fast, possibly using up a lot of heap space.

      The default is 1.0.
    • getMemoryForSpeedTradeFactor

      public double getMemoryForSpeedTradeFactor()
    • clear

      public NfaBuilder<T> clear()
      Clears the stack.
    • matchNothing

      public NfaBuilder<T> matchNothing()

      Pushes an automaton that recognizes nothing.

    • matchEpsilon

      public NfaBuilder<T> matchEpsilon()

      Pushes an automaton that recognizes the empty string.

    • matchString

      public NfaBuilder<T> matchString(CharSequence s, T value)
      Shortcut combining matchString(CharSequence) and addValue(T).
    • matchString

      public NfaBuilder<T> matchString(CharSequence s)
      Pushes an automaton which recognizes exactly the sequence of characters, which may be empty.
    • matchCharset

      public NfaBuilder<T> matchCharset(CharSequence pairs, boolean invert)
      Pushes an automaton recognizing a character set encoded as pairs of characters. For example to recognize all lower case characters and all digits, the input would contain "az09".

      This method is mostly intended for the use by a regular expression parser. Normally one may just use matchRegex("[a-z0-9]").

      Parameters:
      pairs - must have an even number of characters such that each pair of characters encodes an inclusive character range. In particular we must have a<=b.
      invert - if true, the character range is inverted
    • matchRegex

      public NfaBuilder<T> matchRegex(CharSequence regex) throws ReSyntaxException
      Uses the parser configured in the constructor to parse the regular expression and leave the resulting NFA on the stack.
      Parameters:
      regex - to transform into an NFA
      Throws:
      ReSyntaxException - if the parser throws it
    • matchRegex

      public NfaBuilder<T> matchRegex(CharSequence regex, T value) throws ReSyntaxException
      Shortcut combining a call to matchRegex(CharSequence) and addValue(T).
      Throws:
      ReSyntaxException
    • optional

      public NfaBuilder<T> optional()
      Transforms the top finite automaton to recognize the empty string too. All actions defining stop states are kept as they are.
    • plus

      public NfaBuilder<T> plus()

      Transforms the top NFA on the stack to recognize 1 or more copies of its matches, the operation typically provided by the "+" operator of regular expressions.

    • star

      public NfaBuilder<T> star()

      Applies the Kleene closure (*) operator to the top NFA on the stack.

    • shortest

      public NfaBuilder<T> shortest() throws CompileDfaException
      Transforms the top NFA on the stack into one which only recognizes the shortest matches of the given one. This operation is best understood by its implementation, which compiles the NFA into a DFA and then removes all outgoing links of any stop states, where "stop state" of the DFA is defined as a state representing a set of NFA states containing the last state.
      Throws:
      CompileDfaException - if the automaton can not be compiled.
    • complement

      public NfaBuilder<T> complement() throws CompileDfaException

      Inverts the automaton, such that the resulting automaton recognizes the set-complement of the set of strings recognized by the given automaton.

      Stored values are not touched, though states storing them may be pruned when removing dead states. Insofar it is hardly useful to store values before inverting an automaton, as it may be difficult to understand

      Note: Except for rare circumstances you rather want to use notContained() instead. The behavior of this method, while theoretically sound, does not implement the intuitive request match everything but X for some regular expression X, because it matches all string which contain X like "aX" or "bXb".

      Returns:
      this
      Throws:
      CompileDfaException - if the automaton can not be compiled.
    • notContained

      public NfaBuilder<T> notContained() throws CompileDfaException
      If the top NFA on the stack is equivalent to re, we do complement("(.*re.*)?"). The resulting NFA neither recognizes any string containing re nor the empty string.

      The comment about "store values" for complement() applies here too.

      Making the automaton optional with "?" before computing the complement is slightly arbitrary. It prevents the result to match the empty string, which is often not intended. Further it is easy to add the empty string to the set of matching strings with optional() than to remove the empty string match from the result.

      Throws:
      CompileDfaException - this operation requires compilation into a DFA which may throw this exception. See compileToDfa().
    • allPrefixes

      public NfaBuilder<T> allPrefixes() throws CompileDfaException
      This makes sure every prefix, except the empty string, of a string recognized by the top NFA on the stack is also recognized. If the NFA already matches the empty string, though, this is not changed.

      Although the empty string is technically a valid prefix match of all automata except those matching nothing, it is not added. This is because it is much easier to add it with optional() than to remove an unwanted empty match.

      Technically, the automaton is compiled and then all last states get an epsilon transition to a new common last state. Further all non-last states also get an epsilon transition to the new last state.

      In effect, all true prefixes (except the empty string) are also recognized.

      Hint: If you want to differentiate in a compiled Dfa between the prefixes and the full strings recognized, first add a value, for example "FULL" to the automaton. Then call allPrefixes and then add another value, for example "PREFIX". Finally use setStateValueMerger(StateValueMerger) to define a merger which prefers "FULL" over "PREFIX", which may be easy to do with PrioritizedValueMerger.

      Returns:
      this
      Throws:
      CompileDfaException - if the NFA can not be compiled
    • addNonMatches

      public NfaBuilder<T> addNonMatches(T prefixValue, T otherValue) throws CompileDfaException
      Extends the top NFA on the stack to match (nearly) everything. Suppose you want to filter a text for a regular expression re=[a-z]+X, that is a string of lower-case characters followed by a single X. Further suppose there are many strings matching [a-z]+ but not followed by an X. Each time the Dfa finds a prefix match but then fails to see the X, it has to push back the characters into the input and signal and fail. A loop around the matching, say of DfaFilter, must then read one character, handle it somehow and call Dfa.match(monq.jfa.CharSource, java.lang.StringBuilder) again, having made only one character progress and we know already that the next match attempts will fail too.

      This method does three things:

      1. It fills the automaton with matches for the non-empty prefixes and adds a custom value that may signal that it was just a prefix match.
      2. Duplicate and call notContained() on the result of the above. This part matches everything which is neither re nor a prefix of it and it can also get a custom value attached to signal this "other stuff" part.
      3. Finally the two automata produced above are combined with or().

      The resulting automaton matches (nearly) everything somehow and retrieves a value

      1. signalling either a match to re,
      2. to a prefix of it or
      3. to some string which is not contained either of the above.
      When filtering long text, this can be much faster than the original automaton if long partial matches are frequent. What "long" and "frequent" mean depends on so many factors, that it requires testing. It depends not the least on the specific regular expression as well as the text structure. A prediction is not possible.
      Parameters:
      prefixValue - to add to the automaton after calling allPrefixes
      otherValue - to add to matches which are neither re nor prefix of it
      Throws:
      CompileDfaException
    • seq

      public NfaBuilder<T> seq()
      Combines the top two NFA on the stack into one NFA recognizing the sequence of the two individual matches: if the last two elements on the stack recognize, respectively, the strings s and t, the combined automaton recognizes st.
    • seq

      public NfaBuilder<T> seq(CharSequence regex) throws ReSyntaxException
      Shortcut combining a call to matchRegex(CharSequence) and seq().
      Throws:
      ReSyntaxException
    • seq

      public NfaBuilder<T> seq(CharSequence regex, T value) throws ReSyntaxException
      Shortcut combining a call to matchRegex(CharSequence), addValue(T) and seq().
      Throws:
      ReSyntaxException
    • or

      public NfaBuilder<T> or()
      Combines the top two NFA on the stack into one NFA matching strings either one of them matches. Put another way, the resulting automaton recognizes the union of the languages recognized by the input automata.
    • or

      public NfaBuilder<T> or(CharSequence regex) throws ReSyntaxException
      Shortcut combining a call to matchRegex(java.lang.CharSequence) and or().
      Throws:
      ReSyntaxException
    • or

      public NfaBuilder<T> or(CharSequence regex, T value) throws ReSyntaxException
      Shortcut combining a call to matchRegex(java.lang.CharSequence), addValue(T) and or().
      Throws:
      ReSyntaxException
    • and

      public NfaBuilder<T> and() throws CompileDfaException
      Computes the automaton which recognizes the intersection of the languages recognized by the top two automata on the stack.

      Implementation Note: This requires compilation. Consider two automata starting with START1-a->X and START2-ɛ->R-a->T. The intersection automaton would contain (START1,START2)-a->(X,T). But to figure this out, we need an epsilon closure of START2, so this would to a great part repeat compilation code anyway.

      The state of the resulting automaton have a value if both states from which it was derived have a value and the value merger returns non null.

      Throws:
      CompileDfaException
      See Also:
    • and

      Shortcut combining a call to matchRegex(java.lang.CharSequence) and and().
      Throws:
      ReSyntaxException
      CompileDfaException
    • and

      public NfaBuilder<T> and(CharSequence regex, T value) throws ReSyntaxException, CompileDfaException
      Shortcut combining a call to matchRegex(java.lang.CharSequence), addValue(T) and and().
      Throws:
      ReSyntaxException
      CompileDfaException
    • addValue

      public NfaBuilder<T> addValue(T value)

      Adds a value to the last state of the top NFA. If a value is set already, a new last state with the given value is added with an epsilon from the current last state.

      Note:If the automaton has already one or more values (i.e. stop states), the added value may conflict with some of those and the conflict must be resolved by the merger passed to the constructor as soon as an operation compiles the NFA.

      Why is there no setValue(...)?

      An automaton may have values set on several states. If, say, state A of those has an epsilon path to the last state, setting a value on the last state does not remove or invalidate the value set on A. Rather it adds to the values associated with A and it is the task of the merger passed to the constructor to merge those into a single value.

      Parameters:
      value - may not be null
    • pushGraphCopy

      public NfaBuilder<T> pushGraphCopy()
      Pushes a copy of the top stack automaton yet without any values. All values are removed from the copy.
    • pushCopy

      public NfaBuilder<T> pushCopy()
      Creates a copy of the structure of top NFA on the stack. This works like pushCopy(Nfa), see there.
    • pushCopy

      public NfaBuilder<T> pushCopy(Nfa<T> nfa)
      Creates a copy of the automaton starting at nfa.start() and pushes it. The copy of nfa.last() is used as the last state of the copied NFA. Only the structure of the automaton is copied, i.e. values are copied as references, not as deep copiesd. Compare this with List.addAll() which does not copy the list elements, only the list structure.

      HINT: If there is no path (may involve epsilon transitions) from start to last, further operations may have non-intuitive results. For example .matchString("...").seq() is effectively a no-op, because .seq() operates on the non-connected last state.

      Parameters:
      nfa - to copy
    • pushCopy

      public NfaBuilder<T> pushCopy(Dfa<T> dfa)
      Like pushCopy(Nfa). All copies of states deemed stop states by the Dfa.getIsStopValue() are linked via epsilon transition to a new common last state of the resulting NFA.
    • swap

      public NfaBuilder<T> swap()
      Swaps the top 2 elements on the stack of automata.
    • pop

      public Nfa<T> pop()
      Pops the top NFA from the stack.
    • dotPrint

      public NfaBuilder<T> dotPrint(String filename)
    • findPath

      public int findPath(String text)
      Calls findPath(NfaState, NfaState, String, Set) with the the top NFA on the stack.
    • findPath

      public int findPath(String text, Set<NfaState<T>> terminalStates)
      Calls findPath(NfaState, NfaState, String, Set) with the the top NFA on the stack.
    • findPath

      public int findPath(NfaState<T> start, NfaState<T> last, String s, Set<NfaState<T>> terminal)
      An inefficient way to find a path through an NFA, starting at the given state. This is mostly used to test NFA structure.
      Parameters:
      start - where to start
      last - if this state is in an epsilon closure we meet on our way, the epsilon closure "recognizes" the text read so far
      s - to recognize
      terminal - is the set of states reached by traversing s to its end, independent of whether we hit a last state set or not
      Returns:
      denotes the longest prefix of the given string which leads to a recognizing state, in particular 0, if the start state is a recognizing state. If no recognizing state is ever reached traversing the Nfa along the given string, -1 is returned.
    • compileToDfa

      public Dfa<T> compileToDfa() throws CompileDfaException
      Pops the top NFA from the stack, compiles it and returns the resulting DFA.

      If the NFA does not have values, the resulting DFA's Dfa.getAtStart(monq.jfa.CharSource) will not match anything: while NfaBuilder treats the last state of an NFA as a recognizing state, the Dfa methods cannot do this because a Dfa does not have a last state. Rather it uses its Dfa.getIsStopValue() predicate to determine if a state is a recognizing (aka stop) state.

      Returns:
      the compile Dfa
      Throws:
      CompileDfaException
    • escape

      public String escape(String s)

      Escapes special characters according to the rules of the current ReParser configured.

    • escape

      public void escape(StringBuilder out, CharSequence in, int startAt)

      Escapes special characters according to the rules of the current ReParser configured, appending the output to the given StringBuilder.

    • specialChars

      public String specialChars()
      Returns the characters with special meaning for the currently configured ReParser.