String Matching - Intuition Guide¶
The Mental Model: Preprocessing for Speed¶
Imagine searching for a word in a book: - Naive approach: Start at page 1, check every position for your word - Smart approach: Build an index first, then use it to skip irrelevant sections
String Matching algorithms preprocess the pattern (or text) to enable faster searching.
Why Does Naive Search Fail?¶
Consider searching for "AAAB" in "AAAAAAB":
Text: A A A A A A B
Pattern: A A A B
β β β β (mismatch at position 3)
Naive: Shift by 1, restart from scratch
A A A B
β start over
KMP: We matched "AAA" before failing. Use that information!
The failure function tells us: shift to align with longest proper prefix
Key insight: Don't throw away match information when you fail.
KMP: The Failure Function Intuition¶
What is the Failure Function?¶
For pattern "AABAACAAB":
Index: 0 1 2 3 4 5 6 7 8
Pattern: A A B A A C A A B
LPS: 0 1 0 1 2 0 1 2 3
β β β β β β β
β ββA β β β β ββAAB prefix matches AAB suffix
β β ββAA β ββAA
ββno proper prefix = suffix
LPS[i] = length of the longest proper prefix of pattern[0..i] that's also a suffix.
Why Does This Help?¶
When we mismatch at position j, we've already matched pattern[0..j-1]:
Text: ... X X X A A B A A C A A X ...
Pattern: A A B A A C A A B
β mismatch at j=8
We matched "AABAACAA" before failing.
LPS[7] = 2 means: "AA" is both prefix and suffix
So we can shift to align the prefix "AA" where the suffix was:
Text: ... X X X A A B A A C A A X ...
Pattern: A A B A A C A A B
β continue from j=2
No need to re-examine characters we've already matched!
Rabin-Karp: The Rolling Hash Intuition¶
The Basic Idea¶
Instead of comparing characters, compare fingerprints (hashes).
Pattern: "ABC" β hash = 65*256Β² + 66*256 + 67 = 4276803
Text: "XYZABCDEF"
Window 1: "XYZ" β hash β pattern_hash
Window 2: "YZA" β hash β pattern_hash
Window 3: "ZAB" β hash β pattern_hash
Window 4: "ABC" β hash == pattern_hash β verify!
Why "Rolling"?¶
Computing hash from scratch each window = O(m) per window = O(nm) total.
Rolling hash updates in O(1):
hash("YZA") = (hash("XYZ") - 'X' * 256Β²) * 256 + 'A'
= (prev_hash - leftmost_char * highest_power) * base + new_char
Intuition: It's like a conveyor belt. Push new char in, old char falls out.
Pattern 1: Find Needle in Haystack (LC 28) - Base Case¶
The canonical string matching problem:
# KMP approach
def strStr(haystack, needle):
# 1. Build failure function for needle
# 2. Match haystack against needle using failure function
# Rabin-Karp approach
def strStr(haystack, needle):
# 1. Compute hash of needle
# 2. Slide window, update hash in O(1)
# 3. On hash match, verify characters
When to use which? - KMP: Guaranteed O(n+m), no hash collisions - Rabin-Karp: Better when searching for multiple patterns
Pattern 2: Shortest Palindrome (LC 214) - Concatenation Trick¶
Problem: Add minimum characters to front to make string a palindrome.
Key insight: Find longest palindrome prefix, then prepend the reverse of what's left.
s = "aacecaaa"
Longest palindrome prefix: "aacecaa" (7 chars)
Remaining: "a"
Answer: reverse("a") + s = "a" + "aacecaaa" = "aaacecaaa"
The KMP trick: Compute LPS of s + '#' + reverse(s)
Why? The LPS at the end tells us how much of the original string's prefix matches its reversed suffix - exactly the palindrome prefix length!
Pattern 3: Repeated Substring (LC 459) - Period Detection¶
Problem: Check if string is made by repeating a smaller substring.
The beautiful insight:
s = "abcabc"
Period formula: period = len(s) - LPS[-1] = 6 - 3 = 3
If len(s) % period == 0, the string is periodic!
Check: 6 % 3 == 0 β, so "abc" repeats twice
Why does this work?
If LPS[-1] > 0, there's overlap between prefix and suffix:
String: a b c | a b c
Prefix: a b c | (length 3)
Suffix: | a b c (length 3)
Overlap at LPS[-1] = 3
The non-overlapping part (length = n - LPS[-1] = 3) is the minimal period!
Pattern 4: Longest Happy Prefix (LC 1392) - Direct LPS¶
This is the most direct application of the failure function:
s = "level"
LPS array: [0, 0, 0, 0, 1]
l e v e l
ββ 'l' = prefix[0], so LPS[4] = 1
Answer: s[0:LPS[-1]] = s[0:1] = "l"
The last value of the LPS array directly gives us the answer!
The Core Templates¶
KMP Failure Function¶
def compute_lps(pattern):
n = len(pattern)
lps = [0] * n
length = 0 # length of previous longest prefix suffix
i = 1
while i < n:
if pattern[i] == pattern[length]:
length += 1
lps[i] = length
i += 1
else:
if length != 0:
length = lps[length - 1] # Key: fall back, don't increment i
else:
lps[i] = 0
i += 1
return lps
Rolling Hash¶
def rolling_hash(text, m):
BASE = 256
MOD = 10**9 + 7
h = 0
highest_power = pow(BASE, m - 1, MOD)
for i in range(m):
h = (h * BASE + ord(text[i])) % MOD
yield 0, h # (start_index, hash)
for i in range(m, len(text)):
# Remove leftmost, add rightmost
h = (h - ord(text[i - m]) * highest_power) % MOD
h = (h * BASE + ord(text[i])) % MOD
yield i - m + 1, h
Pattern Recognition Checklist¶
| Question | If Yes β String Matching |
|---|---|
| Finding substring in larger string? | β |
| Need to know if prefix equals suffix? | β |
| Checking for periodic/repeating patterns? | β |
| Need palindrome information efficiently? | β |
Variant Recognition¶
| Clue | Technique |
|---|---|
| "Find first occurrence" | KMP or Rabin-Karp |
| "Count occurrences" | KMP with counter |
| "Shortest palindrome prefix" | KMP on s + '#' + rev(s) |
| "Is string periodic?" | LPS[-1] and divisibility |
| "Longest prefix = suffix" | Direct LPS[-1] |
| "Multiple pattern search" | Rabin-Karp or Aho-Corasick |
Common Pitfalls¶
- Off-by-one in LPS: LPS[i] is for substring [0..i], not [0..i-1]
- Forgetting separator: When using
s + rev(s), add '#' to prevent false matches - Hash collisions: Always verify on hash match in Rabin-Karp
- Modular arithmetic: Use
% MODafter every operation to prevent overflow - Period formula: It's
n - LPS[-1], notLPS[-1]
Complexity¶
| Algorithm | Time | Space |
|---|---|---|
| KMP | O(n + m) | O(m) |
| Rabin-Karp | O(n + m) average | O(1) |
| Naive | O(nm) | O(1) |
Where n = text length, m = pattern length.
The Power of Preprocessing¶
The fundamental insight of string matching: invest O(m) preprocessing to save O(nm) searching.
This preprocessing vs. querying trade-off appears throughout algorithm design: - Binary Search: O(n log n) sort for O(log n) queries - Segment Tree: O(n) build for O(log n) range queries - KMP: O(m) LPS computation for O(n) pattern matching