Build a DFA from a regular expression

This algorithm builds a deterministic finite automaton (DFA) equivalent to a regular expression that it is given. While the application in sequence of the McNaughton-Yamada-Thompson algorithm followed by the subset construction yields a DFA from a regular expression, this algorithm generates a DFA directly without the intermediate NDFA.

The first step in this algorithm is to extend the regular expression to make sure that all words in the language end with the same character (different from the rest of characters in the alphabet). Then a syntax tree for the regular expression is built. All leaf nodes in this tree are labeled with a position, with the exception of the leaves corresponding to the symbol ε that is not labeled with a position. The position allows the algorithm to distinguish the different occurrences of a symbol in a regular expression. Now the functions nullable, firstpos, lastpos, followpos are computed in sequence.

INPUT: A regular expression r.

OUTPUT: A DFA D that recognizes L(r).

METHOD:

  1. Build a syntactic tree T from the augmented regular expression (r)#.
  2. Compute nullable nodes, fistpos, lastpos and nextpos for T (as it is shown below).
  3. Build the transition table of the automaton D (as it is shown below).
A node is nullable only if the node fulfils one of these criteria:
  • It contains the empty symbol (epsilon).
  • It contains the union expression and one of its children is nullable.
  • It contains the concatenation expression and its both children are nullable.
  • It contains the closure expression.
An example of all these cases is below, they are highlighted with a red circle. Ejemplo de nodos anulables
Each node has two attributes: the firstpos and the lastpos, they indicate what symbols are expected before or after the current node. Since the same symbol may appear several times in a regular expression, actually their positions are used.
The following rules are used to compute the firstpos:
  • If the node is a leave node for epsilon, its firstpos set is empty.
  • If the node is a leaf node for a symbol of the alphabet, the firstpos set has only one position, which corresponds with the position of the symbol.
  • If the node is for the selection operator, the firstpos set is the union of its children's firstpos.
  • If the node is for the concatenation operator, the firstpos set is the firstpos of its left child and, if the left child is nullable, it also contains the firstpos of its right child.
  • If the node is for the Kleene star operator, the firstpos set is the same as its child.
The following rules are used to compute the lastpos:
  • If the node is a leave node for epsilon, its lastpos set is empty.
  • If the node is a leaf node for a symbol of the alphabet, the lastpos set has only one position, which corresponds with the position of the symbol.
  • If the node is for the selection operator, the lastpos set is the union of its children's lastpos.
  • If the node is for the concatenation operator, the lastpos set is the lastpos of its right child and, if the right child is nullable, it also contains the lastpos of its left child.
  • If the node is for the Kleene star operator, the lastpos set is the same as its child.
An example of the syntactic tree of the automaton to the a.b|c*expression is shown below. Ejemplo del árbol a|b.c*

The function followpos(p) takes one position as an argument (p), and returns all the positions that could follow it. To calculate this function only the concatenation and closure nodes are considered:

  • If n is a concatenation node with left child c1 and right child c2, and p is a position within lastpos(c1), then all positions in firstpos(c2) are added to followpos(p).
  • If n is a Kleene star node, and p is a position in lastpos(n), then all positions of firstpos(n) are added to followpos(p).

Once the followpos function is obtained, the syntactic tree is no longer needed. The transition table can be calculated using only the followpos function. Each different set of positions will be a new state in the automaton. The initial state of the automaton is the set of positions in firstpos of the root node. Now, given a state S, to determine the state N that is reached when the symbol a is read, we first select all the positions in S that label a leaf node with the symbol a, then, we make the union of the followpos of all those selected positions. Or in a more formal way: Cálculo de f(S,a)

To obtain the automaton, in JSON format, just use the address /ahoSethiUllmanJson with get expr parameter, where expr is the regular expression that you want.

For example: a|b.c*

Instructions

Characters to use in precedence order:
  • Parenthesis: ( )
  • Closure: *
  • Concatenation .
  • Union: |
  • Symbols: lowercase letters [a-z]
  • Epsilon: ε, E, €

Try a regular expression