Module regex::dfa [−][src]
The DFA matching engine.
A DFA provides faster matching because the engine is in exactly one state at any point in time. In the NFA, there may be multiple active states, and considerable CPU cycles are spent shuffling them around. In finite automata speak, the DFA follows epsilon transitions in the regex far less than the NFA.
A DFA is a classic trade off between time and space. The NFA is slower, but its memory requirements are typically small and predictable. The DFA is faster, but given the right regex and the right input, the number of states in the DFA can grow exponentially. To mitigate this space problem, we do two things:
- We implement an online DFA. That is, the DFA is constructed from the NFA during a search. When a new state is computed, it is stored in a cache so that it may be reused. An important consequence of this implementation is that states that are never reached for a particular input are never computed. (This is impossible in an "offline" DFA which needs to compute all possible states up front.)
- If the cache gets too big, we wipe it and continue matching.
In pathological cases, a new state can be created for every byte of input.
(e.g., The regex (a|b)*a(a|b){20}
on a long sequence of a's and b's.)
In this case, performance regresses to slightly slower than the full NFA
simulation, in large part because the cache becomes useless. If the cache
is wiped too frequently, the DFA quits and control falls back to one of the
NFA simulations.
Because of the "lazy" nature of this DFA, the inner matching loop is considerably more complex than one might expect out of a DFA. A number of tricks are employed to make it fast. Tread carefully.
N.B. While this implementation is heavily commented, Russ Cox's series of articles on regexes is strongly recommended: https://swtch.com/~rsc/regexp/ (As is the DFA implementation in RE2, which heavily influenced this implementation.)
Re-exports
use std::collections::HashMap; |
use std::fmt; |
use std::iter::repeat; |
use std::mem; |
use exec::ProgramCache; |
use prog::Inst; |
use prog::Program; |
use sparse::SparseSet; |
Structs
Byte |
Byte is a u8 in spirit, but a u16 in practice so that we can represent the special EOF sentinel value. |
Cache |
A reusable cache of DFA states. |
CacheInner |
|
EmptyFlags |
A set of flags for zero-width assertions. |
Fsm |
Fsm encapsulates the actual execution of the DFA. |
InstPtrs | |
State |
|
StateFlags |
A set of flags describing various configurations of a DFA state. This is
represented by a |
Transitions |
The transition table. |
TransitionsRow |
Enums
Result |
The result of running the DFA. |
Constants
STATE_DEAD |
A dead state means that the state has been computed and it is known that once it is entered, no future match can ever occur. |
STATE_MATCH |
A match state means that the regex has successfully matched. |
STATE_MAX |
The maximum state pointer. This is useful to mask out the "valid" state pointer from a state with the "start" or "match" bits set. |
STATE_QUIT |
A quit state means that the DFA came across some input that it doesn't know how to process correctly. The DFA should quit and another matching engine should be run in its place. |
STATE_START |
A start state is a state that the DFA can start in. |
STATE_UNKNOWN |
An unknown state means that the state has not been computed yet, and that the only way to progress is to compute it. |
Functions
can_exec |
Return true if and only if the given program can be executed by a DFA. |
push_inst_ptr |
Adds ip to data using delta encoding with respect to prev. |
read_vari32 |
https://developers.google.com/protocol-buffers/docs/encoding#varints |
read_varu32 |
https://developers.google.com/protocol-buffers/docs/encoding#varints |
show_state_ptr | |
usize_to_u32 | |
vb |
Helper function for formatting a byte as a nice-to-read escaped string. |
write_vari32 |
https://developers.google.com/protocol-buffers/docs/encoding#varints |
write_varu32 |
https://developers.google.com/protocol-buffers/docs/encoding#varints |
Type Definitions
InstPtr |
|
StatePtr |
|