Module regex::dfa[][src]

The DFA matching engine.

A DFA provides faster matching because the engine is in exactly one state at any point in time. In the NFA, there may be multiple active states, and considerable CPU cycles are spent shuffling them around. In finite automata speak, the DFA follows epsilon transitions in the regex far less than the NFA.

A DFA is a classic trade off between time and space. The NFA is slower, but its memory requirements are typically small and predictable. The DFA is faster, but given the right regex and the right input, the number of states in the DFA can grow exponentially. To mitigate this space problem, we do two things:

  1. We implement an online DFA. That is, the DFA is constructed from the NFA during a search. When a new state is computed, it is stored in a cache so that it may be reused. An important consequence of this implementation is that states that are never reached for a particular input are never computed. (This is impossible in an "offline" DFA which needs to compute all possible states up front.)
  2. If the cache gets too big, we wipe it and continue matching.

In pathological cases, a new state can be created for every byte of input. (e.g., The regex (a|b)*a(a|b){20} on a long sequence of a's and b's.) In this case, performance regresses to slightly slower than the full NFA simulation, in large part because the cache becomes useless. If the cache is wiped too frequently, the DFA quits and control falls back to one of the NFA simulations.

Because of the "lazy" nature of this DFA, the inner matching loop is considerably more complex than one might expect out of a DFA. A number of tricks are employed to make it fast. Tread carefully.

N.B. While this implementation is heavily commented, Russ Cox's series of articles on regexes is strongly recommended: https://swtch.com/~rsc/regexp/ (As is the DFA implementation in RE2, which heavily influenced this implementation.)

Structs

Byte

Byte is a u8 in spirit, but a u16 in practice so that we can represent the special EOF sentinel value.

Cache

A reusable cache of DFA states.

CacheInner

CacheInner is logically just a part of Cache, but groups together fields that aren't passed as function parameters throughout search. (This split is mostly an artifact of the borrow checker. It is happily paid.)

EmptyFlags

A set of flags for zero-width assertions.

Fsm

Fsm encapsulates the actual execution of the DFA.

InstPtrs
State

State is a DFA state. It contains an ordered set of NFA states (not necessarily complete) and a smattering of flags.

StateFlags

A set of flags describing various configurations of a DFA state. This is represented by a u8 so that it is compact.

Transitions

The transition table.

TransitionsRow

Enums

Result

The result of running the DFA.

Constants

STATE_DEAD

A dead state means that the state has been computed and it is known that once it is entered, no future match can ever occur.

STATE_MATCH

A match state means that the regex has successfully matched.

STATE_MAX

The maximum state pointer. This is useful to mask out the "valid" state pointer from a state with the "start" or "match" bits set.

STATE_QUIT

A quit state means that the DFA came across some input that it doesn't know how to process correctly. The DFA should quit and another matching engine should be run in its place.

STATE_START

A start state is a state that the DFA can start in.

STATE_UNKNOWN

An unknown state means that the state has not been computed yet, and that the only way to progress is to compute it.

Functions

can_exec

Return true if and only if the given program can be executed by a DFA.

push_inst_ptr

Adds ip to data using delta encoding with respect to prev.

read_vari32

https://developers.google.com/protocol-buffers/docs/encoding#varints

read_varu32

https://developers.google.com/protocol-buffers/docs/encoding#varints

show_state_ptr
usize_to_u32
vb

Helper function for formatting a byte as a nice-to-read escaped string.

write_vari32

https://developers.google.com/protocol-buffers/docs/encoding#varints

write_varu32

https://developers.google.com/protocol-buffers/docs/encoding#varints

Type Definitions

InstPtr

InstPtr is a 32 bit pointer into a sequence of opcodes (i.e., it indexes an NFA state).

StatePtr

StatePtr is a 32 bit pointer to the start of a row in the transition table.