Knuth-Morris-Pratt algorithm for O(N) substring search
The Knuth-Morris-Pratt substring algorithm is an algorithm
to check if a superstring S
of length N contains a substring W
(word)
in O(N) time. It does this by preprocessing the substring W
to avoid
rechecking parts of string S
we've already read. How does it do this?
Let's go over it, starting from a naive substring algorithm and
optimizing it to get to Knuth-Morris-Pratt.
ToC
- Naive substring algorithm
- How to take advantage of already looked at data
- Generating a prefix fallback structure
Naive substring algorithm
A naive/obvious substring algorithm is to check every possible
substring of S
to see if it equals W
. For example, take
superstring S
"babababababababooie" and substring W
"bababooie".
Our naive algorithm would check:
- Does
S[0:9] == W
? No. (9 is the length ofW
) - Does
S[1:10] == W
? No. - Does
S[2:11] == W
? No. - ...
Until we get to the end of the string and find "bababooie".
This algorithm is inefficient, but it might not be obvious why. To understand, let's break down what each step would be doing more specifically.
Inefficiency with repeated string equality checks
To think about what a string equality check is doing, let's take our first step from above:
- Does
S[0:9] == W
?
This would check
- Does
S[0] (b) == W[0] (b)
? Yes! - Does
S[1] (a) == W[1] (a)
? Yes! - Does
S[2] (b) == W[2] (b)
? Yes! - Does
S[3] (a) == W[3] (a)
? Yes! - Does
S[4] (b) == W[4] (b)
? Yes! - Does
S[5] (a) == W[5] (o)
? No, stop.
This first check is completely efficient: there's no way we can understand if the strings are equal without examining the characters in them.
However, our second equality check
- Does
S[1:10] == W
?
Is not efficient. It is efficient in isolation, but not efficient given our previous check. This check will, like the first one,
keep examining characters of S
and W
until a mismatch. However,
we should hopefully be able to avoid re-examining characters we
already looked at in step 1.
How to take advantage of already looked at data
To understand how to optimize by avoiding repeated checks, let's
look at steps 1 through 3. Reminder that our superstring S
is "babababababababooie" and substring W
is "bababooie"
- Does
S[0:9] == W
? - Does
S[1:10] == W
? - Does
S[2:11] == W
?
For simplicity, let's say we will examine all of W
beforehand.
After performing step 1, with sub steps:
- Does
S[0] (b) == W[0] (b)
? Yes! - Does
S[1] (a) == W[1] (a)
? Yes! - Does
S[2] (b) == W[2] (b)
? Yes! - Does
S[3] (a) == W[3] (a)
? Yes! - Does
S[4] (b) == W[4] (b)
? Yes! - Does
S[5] (a) == W[5] (o)
? No, stop.
Our understanding of what S
and W
looks like is:
S: "bababa????????????"
W: "bababooie"
Our naive algorithm would take all this info, throw it away, and
start looking at S
again starting at index 1.
You may notice that, because our string is pretty repetitive, rather
than starting over, we could just shift our string W
along S
:
S: "bababa????????????"
W: >>"bababooie"
and continue checking from index 6 in S
rather than starting over.
In concrete terms, what we are doing is matching a
prefix of our substring W
to a suffix of the part of superstring S
that we have already examined.
This is what Knuth-Morris-Pratt does and how it avoids doing repeated
checks. However, to actually execute this, we will preprocess W
and compare parts of it to itself.
Designing a structure to allow shifting W
along S
While comparing the strings W
and S
, the part of W
we have matched
with S
is a prefix of W
.
After failing a match, we can retain our progress by finding
a suffix of the part of W
we have already matched that is
also a prefix of W
:
S: "bababa????????????"
^^^
W: "babab"
^^^
W: "bababooie"
Finding this suffix would normally be time consuming, but if we
preprocess W
, we can calculate and store, for each prefix P1
of
W, the longest prefix P2
of W
such that P2
is a suffix of P1
.
Let's call this structure we create to store, for each prefix P1
of
W, the longest prefix P2
of W
such that P2
is a suffix of P1
a prefix fallback. We would like to "fallback" to a shorter
prefix of W
rather than starting over from scratch when possible.
For "bababooie", we would want something like the following:
prefix_fallback("bababooie", "b") = ""
prefix_fallback("bababooie", "ba") = ""
prefix_fallback("bababooie", "bab") = "b"
^ ^
prefix_fallback("bababooie", "baba") = "ba"
^^ ^^
prefix_fallback("bababooie", "babab") = "bab"
^^^ ^^^
prefix_fallback("bababooie", "bababo") = ""
...
Then, when we mismatch between S
and W
when comparing characters, and currently have matched P1
to some part
of S
,
we can combine the facts that
- Prefix
P1
matches/aligns withS
. P2 = prefix_fallback(W, P1)
matches/aligns withP1
.
to transitively align P2
-> P1
-> S
without needing to reexamine characters. Looking again at our example from earlier:
P1 = "babab"
P2 = prefix_fallback("bababooie", "babab") = "bab"
^^^ ^^^
S : "babab|a????????????"
^^^
P1: "babab|ooie"
^^^
P2: "bab abooie"
Generating a prefix fallback structure
Repeating what we've said earlier, our prefix fallback will store,
for each prefix P1
of
W
, the longest prefix P2
of W
such that P2
is a suffix of P1
.
We can define how to find a fallback for a prefix recursively. Let's take a generic prefix P1
of W
of length x+1, and split it into two
sections of length x and length 1: P1=W[0:x]W[x]
. We can
determine the longest prefix P2
that is a suffix of P1
by looking
at the longest prefix that is a suffix of W[0:x]
and seeing if it is
followed by W[x]
.
Storing indices instead of strings
As an optimization, instead of storing strings in our prefix
fallback, we can notice that every prefix of W
can be identified
by its length. Instead of using strings to store prefixes,
just use their lengths instead. This also allows us to use
an array instead of a dictionary like structure.
# For string "bababooie"
prefix_fallback[1] = 0 # "b" -> ""
prefix_fallback[2] = 0 # "ba" -> ""
prefix_fallback[3] = 1 # "bab" -> "b"
prefix_fallback[4] = 2 # "baba" -> "ba"
prefix_fallback[5] = 3 # "babab" -> "bab"
prefix_fallback[6] = 0 # "bababo" -> ""
...
Building the prefix fallback structure iteratively
We generate our prefix fallback structure, now an array mapping prefix lengths to fallback prefix lengths, iteratively, starting from a prefix of length 1 and building up to the whole string.
def build_fallback_structure(w):
fallback_length_by_prefix_length = [NO_FALLBACK_FOUND] * len(w)
# The only suffix of a string of length 1 is a string of length 0.
fallback_length_by_prefix_length[1] = 0
for current_prefix_length in range(2, len(w)):
char = w[current_prefix_length - 1]
previous_prefix_length = current_prefix_length - 1
fallback_prefix_length = fallback_length_by_prefix_length[
previous_prefix_length
]
while fallback_prefix_length != NO_FALLBACK_FOUND:
char_after_fallback_prefix = w[fallback_prefix_length]
if char_after_fallback_prefix == char:
fallback_length_by_prefix_length[current_prefix_length] = (
fallback_prefix_length + 1
)
break
fallback_prefix_length = fallback_length_by_prefix_length[
fallback_prefix_length
]
else:
# If we fail to fall back to any prefix, we can use the prefix "", which
# essentially represents starting from scratch again on the search.
fallback_length_by_prefix_length[current_prefix_length] = 0
return fallback_length_by_prefix_length
We can then precompute and store this array to check if
W
is contained by a superstring S
in linear time as described earlier.
class KnuthMorrisPratt:
def __init__(self, w: str):
self.string = w
# List matching each prefix p1 to the next longest prefix/fallback p2 that is
# a suffix of p1.
self._fallback_length_by_prefix_length = build_fallback_structure(w)
def contained_by(self, s: str) -> bool:
"""Check if self.string is a substring of superstring."""
current_matched_prefix_length = 0
for superstring_char in s:
while current_matched_prefix_length != NO_FALLBACK_FOUND:
next_string_char = self.string[current_matched_prefix_length]
if next_string_char == superstring_char:
current_matched_prefix_length += 1
break
# Fall back to the next longest prefix that is a suffix of the current prefix.
current_matched_prefix_length = self._fallback_length_by_prefix_length[
current_matched_prefix_length
]
else:
current_matched_prefix_length = 0
if current_matched_prefix_length == len(self.string):
return True
return False
Code snippets are taken from here. These code snippets above make up the entirety of the algorithm.