KMP Algorithm Introduction

A comprehensive guide to understanding the Knuth-Morris-Pratt string matching algorithm

Introduction

The Knuth-Morris-Pratt (KMP) algorithm is a classic string matching algorithm used to find occurrences of a pattern within a text string. The naive approach to string matching involves comparing each character one by one, which has a worst-case time complexity of O(n*m), where n is the length of the main string and m is the length of the pattern.

Core Idea

The main idea of KMP is to avoid unnecessary comparisons by utilizing information from previous matches. When we encounter a mismatch, instead of starting over from the next character in the main string, we can use a precomputed "next" array to skip some characters and continue matching from an optimal position.

Understanding the Next Array

Let's understand the concept with an example. Suppose we have:

Main string: "abababca"

Substring: "ababca"

Next array: [0, 0, 1, 2, 0, 1]

Step-by-Step Matching Process

  1. Compare main[0] 'a' with substring[0] 'a' → match, advance both pointers
  2. Compare main[1] 'b' with substring[1] 'b' → match, advance both pointers
  3. Compare main[2] 'a' with substring[2] 'a' → match, advance both pointers
  4. Compare main[3] 'b' with substring[3] 'b' → match, advance both pointers
  5. Compare main[4] 'a' with substring[4] 'c' → mismatch
  6. Use next[3] = 2 to jump back to substring[2] and continue matching
  7. Continue matching until substring is fully matched

Building the Next Array

The next array is constructed by analyzing the substring's prefixes and suffixes:

Prefix: All substrings starting from the first character (excluding the entire string)

Suffix: All substrings ending at the last character (excluding the entire string)

Example: Substring "ababca"

1. Position 0 ("a"):

Prefixes: ∅

Suffixes: ∅

Next[0] = 0

2. Position 1 ("ab"):

Prefixes: ["a"]

Suffixes: ["b"]

Next[1] = 0

3. Position 2 ("aba"):

Prefixes: ["a", "ab"]

Suffixes: ["ba", "a"]

Common: "a"

Next[2] = 1

4. Position 3 ("abab"):

Prefixes: ["a", "ab", "aba"]

Suffixes: ["bab", "ab", "b"]

Common: "ab"

Next[3] = 2

5. Position 4 ("ababc"):

Prefixes: ["a", "ab", "aba", "abab"]

Suffixes: ["babc", "abc", "bc", "c"]

No common elements

Next[4] = 0

6. Position 5 ("ababca"):

Prefixes: ["a", "ab", "aba", "abab", "ababc"]

Suffixes: ["babca", "abca", "bca", "ca", "a"]

Common: "a"

Next[5] = 1

Implementation

Building the Next Array

def build_kmp_table(substring):
    n = len(substring)
    next = [0] * n  # Initialize the table with all zeros
    j = 0  # Length of the previous longest prefix-suffix
    
    for i in range(1, n):
        # If mismatch, backtrack to the previous prefix
        while j > 0 and substring[i] != substring[j]:
            j = next[j - 1]
        
        # If match, extend the current prefix
        if substring[i] == substring[j]:
            j += 1
        
        next[i] = j  # Update the table
    
    return next

KMP Search Implementation

def kmp_search(text, substring):
    if not substring:
        return 0  # Empty substring is always found at index 0
    
    next = build_kmp_table(substring)
    j = 0  # Pointer for substring
    
    for i in range(len(text)):  # Pointer for text
        # If mismatch, follow the 'next' table to shift substring
        while j > 0 and text[i] != substring[j]:
            j = next[j - 1]
        
        # If match, move both pointers forward
        if text[i] == substring[j]:
            j += 1
        
        # If we matched the entire substring
        if j == len(substring):
            return i - j + 1  # Match found at this index
    
    return -1  # No match found

Time Complexity

The KMP algorithm has a time complexity of O(n + m), where:

n is the length of the main string

m is the length of the substring

This is a significant improvement over the naive approach's O(n*m) complexity.

Conclusion

The KMP algorithm is a powerful tool for string matching that provides efficient solutions to pattern matching problems. By utilizing the next array to avoid unnecessary comparisons, it achieves linear time complexity, making it much more efficient than the naive approach for large texts and patterns.