Introduction
The Knuth-Morris-Pratt (KMP) algorithm is a classic string matching algorithm used to find occurrences of a pattern within a text string. The naive approach to string matching involves comparing each character one by one, which has a worst-case time complexity of O(n*m), where n is the length of the main string and m is the length of the pattern.
Core Idea
The main idea of KMP is to avoid unnecessary comparisons by utilizing information from previous matches. When we encounter a mismatch, instead of starting over from the next character in the main string, we can use a precomputed "next" array to skip some characters and continue matching from an optimal position.
Understanding the Next Array
Let's understand the concept with an example. Suppose we have:
Main string: "abababca"
Substring: "ababca"
Next array: [0, 0, 1, 2, 0, 1]
Step-by-Step Matching Process
- Compare main[0] 'a' with substring[0] 'a' → match, advance both pointers
- Compare main[1] 'b' with substring[1] 'b' → match, advance both pointers
- Compare main[2] 'a' with substring[2] 'a' → match, advance both pointers
- Compare main[3] 'b' with substring[3] 'b' → match, advance both pointers
- Compare main[4] 'a' with substring[4] 'c' → mismatch
- Use next[3] = 2 to jump back to substring[2] and continue matching
- Continue matching until substring is fully matched
Building the Next Array
The next array is constructed by analyzing the substring's prefixes and suffixes:
Prefix: All substrings starting from the first character (excluding the entire string)
Suffix: All substrings ending at the last character (excluding the entire string)
Example: Substring "ababca"
1. Position 0 ("a"):
Prefixes: ∅
Suffixes: ∅
Next[0] = 0
2. Position 1 ("ab"):
Prefixes: ["a"]
Suffixes: ["b"]
Next[1] = 0
3. Position 2 ("aba"):
Prefixes: ["a", "ab"]
Suffixes: ["ba", "a"]
Common: "a"
Next[2] = 1
4. Position 3 ("abab"):
Prefixes: ["a", "ab", "aba"]
Suffixes: ["bab", "ab", "b"]
Common: "ab"
Next[3] = 2
5. Position 4 ("ababc"):
Prefixes: ["a", "ab", "aba", "abab"]
Suffixes: ["babc", "abc", "bc", "c"]
No common elements
Next[4] = 0
6. Position 5 ("ababca"):
Prefixes: ["a", "ab", "aba", "abab", "ababc"]
Suffixes: ["babca", "abca", "bca", "ca", "a"]
Common: "a"
Next[5] = 1
Implementation
Building the Next Array
def build_kmp_table(substring): n = len(substring) next = [0] * n # Initialize the table with all zeros j = 0 # Length of the previous longest prefix-suffix for i in range(1, n): # If mismatch, backtrack to the previous prefix while j > 0 and substring[i] != substring[j]: j = next[j - 1] # If match, extend the current prefix if substring[i] == substring[j]: j += 1 next[i] = j # Update the table return next
KMP Search Implementation
def kmp_search(text, substring): if not substring: return 0 # Empty substring is always found at index 0 next = build_kmp_table(substring) j = 0 # Pointer for substring for i in range(len(text)): # Pointer for text # If mismatch, follow the 'next' table to shift substring while j > 0 and text[i] != substring[j]: j = next[j - 1] # If match, move both pointers forward if text[i] == substring[j]: j += 1 # If we matched the entire substring if j == len(substring): return i - j + 1 # Match found at this index return -1 # No match found
Time Complexity
The KMP algorithm has a time complexity of O(n + m), where:
n is the length of the main string
m is the length of the substring
This is a significant improvement over the naive approach's O(n*m) complexity.
Conclusion
The KMP algorithm is a powerful tool for string matching that provides efficient solutions to pattern matching problems. By utilizing the next array to avoid unnecessary comparisons, it achieves linear time complexity, making it much more efficient than the naive approach for large texts and patterns.