Struct regex::literal::BoyerMooreSearch [−][src]
pub struct BoyerMooreSearch { pattern: Vec<u8>, skip_table: Vec<usize>, guard: u8, guard_reverse_idx: usize, md2_shift: usize, }
An implementation of Tuned Boyer-Moore as laid out by Andrew Hume and Daniel Sunday in "Fast String Searching". O(n) in the size of the input.
Fast string searching algorithms come in many variations, but they can generally be described in terms of three main components.
The skip loop is where the string searcher wants to spend as much time as possible. Exactly which character in the pattern the skip loop examines varies from algorithm to algorithm, but in the simplest case this loop repeated looks at the last character in the pattern and jumps forward in the input if it is not in the pattern. Robert Boyer and J Moore called this the "fast" loop in their original paper.
The match loop is responsible for actually examining the whole potentially matching substring. In order to fail faster, the match loop sometimes has a guard test attached. The guard test uses frequency analysis of the different characters in the pattern to choose the least frequency occurring character and use it to find match failures as quickly as possible.
The shift rule governs how the algorithm will shuffle its test window in the event of a failure during the match loop. Certain shift rules allow the worst-case run time of the algorithm to be shown to be O(n) in the size of the input rather than O(nm) in the size of the input and the size of the pattern (as naive Boyer-Moore is).
"Fast String Searching", in addition to presenting a tuned algorithm, provides a comprehensive taxonomy of the many different flavors of string searchers. Under that taxonomy TBM, the algorithm implemented here, uses an unrolled fast skip loop with memchr fallback, a forward match loop with guard, and the mini Sunday's delta shift rule. To unpack that you'll have to read the paper.
Fields
pattern: Vec<u8>
The pattern we are going to look for in the haystack.
skip_table: Vec<usize>
The skip table for the skip loop.
Maps the character at the end of the input to a shift.
guard: u8
The guard character (least frequently occurring char).
guard_reverse_idx: usize
The reverse-index of the guard character in the pattern.
md2_shift: usize
Daniel Sunday's mini generalized delta2 shift table.
We use a skip loop, so we only have to provide a shift for the skip char (last char). This is why it is a mini shift rule.
Methods
impl BoyerMooreSearch
[src]
impl BoyerMooreSearch
fn new(pattern: Vec<u8>) -> Self
[src]
fn new(pattern: Vec<u8>) -> Self
Create a new string searcher, performing whatever compilation steps are required.
fn find(&self, haystack: &[u8]) -> Option<usize>
[src]
fn find(&self, haystack: &[u8]) -> Option<usize>
Find the pattern in haystack
, returning the offset
of the start of the first occurrence of the pattern
in haystack
.
fn len(&self) -> usize
[src]
fn len(&self) -> usize
fn should_use(pattern: &[u8]) -> bool
[src]
fn should_use(pattern: &[u8]) -> bool
The key heuristic behind which the BoyerMooreSearch lives.
See rust-lang/regex/issues/408
.
Tuned Boyer-Moore is actually pretty slow! It turns out a handrolled platform-specific memchr routine with a bit of frequency analysis sprinkled on top actually wins most of the time. However, there are a few cases where Tuned Boyer-Moore still wins.
If the haystack is random, frequency analysis doesn't help us, so Boyer-Moore will win for sufficiently large needles. Unfortunately, there is no obvious way to determine this ahead of time.
If the pattern itself consists of very common characters,
frequency analysis won't get us anywhere. The most extreme
example of this is a pattern like eeeeeeeeeeeeeeee
. Fortunately,
this case is wholly determined by the pattern, so we can actually
implement the heuristic.
A third case is if the pattern is sufficiently long. The idea here is that once the pattern gets long enough the Tuned Boyer-Moore skip loop will start making strides long enough to beat the asm deep magic that is memchr.
fn check_match(&self, haystack: &[u8], window_end: usize) -> bool
[src]
fn check_match(&self, haystack: &[u8], window_end: usize) -> bool
Check to see if there is a match at the given position
fn skip_loop(
&self,
haystack: &[u8],
window_end: usize,
backstop: usize
) -> Option<usize>
[src]
fn skip_loop(
&self,
haystack: &[u8],
window_end: usize,
backstop: usize
) -> Option<usize>
Skip forward according to the shift table.
Returns the offset of the next occurrence
of the last char in the pattern, or the none
if it never reappears. If skip_loop
hits the backstop
it will leave early.
fn compile_skip_table(pattern: &[u8]) -> Vec<usize>
[src]
fn compile_skip_table(pattern: &[u8]) -> Vec<usize>
Compute the ufast skip table.
fn select_guard(pattern: &[u8]) -> (u8, usize)
[src]
fn select_guard(pattern: &[u8]) -> (u8, usize)
Select the guard character based off of the precomputed frequency table.
fn compile_md2_shift(pattern: &[u8]) -> usize
[src]
fn compile_md2_shift(pattern: &[u8]) -> usize
If there is another occurrence of the skip char, shift to it, otherwise just shift to the next window.
fn approximate_size(&self) -> usize
[src]
fn approximate_size(&self) -> usize
Trait Implementations
impl Clone for BoyerMooreSearch
[src]
impl Clone for BoyerMooreSearch
fn clone(&self) -> BoyerMooreSearch
[src]
fn clone(&self) -> BoyerMooreSearch
Returns a copy of the value. Read more
fn clone_from(&mut self, source: &Self)
1.0.0[src]
fn clone_from(&mut self, source: &Self)
Performs copy-assignment from source
. Read more
impl Debug for BoyerMooreSearch
[src]
impl Debug for BoyerMooreSearch
Auto Trait Implementations
impl Send for BoyerMooreSearch
impl Send for BoyerMooreSearch
impl Sync for BoyerMooreSearch
impl Sync for BoyerMooreSearch