Pattern Matching | LIZIU DSA

Why Should You Care About Pattern Matching?

Pattern matching is everywhere in modern technology! Let's see where you encounter it daily:

📝

Text Editors & IDEs

When you press Ctrl+F to find text, or when your IDE highlights syntax errors, that's pattern matching in action! VS Code uses Boyer-Moore for fast text search.

🧬

DNA Sequencing

Scientists use pattern matching to find specific gene sequences in DNA strands. A single human genome has 3 billion base pairs - imagine finding a specific sequence without efficient algorithms!

🔎

Search Engines

Google processes billions of searches daily. They use advanced pattern matching to find your query in trillions of web pages. Without efficient algorithms, searches would take forever!

🛡️

Security & Antivirus

Antivirus software scans files for known virus signatures using pattern matching. They need to check millions of patterns quickly without slowing down your computer.

📚

Plagiarism Detection

Tools like Turnitin use sophisticated pattern matching to compare submitted work against billions of documents to detect potential plagiarism.

The Basics - Let's Start Simple!

What is Pattern Matching?

Imagine you're reading a book and looking for every occurrence of the word "magic". Pattern matching is teaching a computer to do exactly that - find a specific pattern (like "magic") in a larger text (like your book).

Try It: Naive Pattern Matching

See how the simplest algorithm checks every position:

Your First Pattern Matching Algorithm 🎯

# The Naive Approach - Simple but it works!
def find_pattern_simple(text, pattern):
    # Think of it like sliding a ruler along a page
    matches = []
    
    # Try every starting position in the text
    for start in range(len(text) - len(pattern) + 1):
        # Check if pattern matches at this position
        found = True
        for i in range(len(pattern)):
            if text[start + i] != pattern[i]:
                found = False
                break  # No match, try next position
        
        if found:
            matches.append(start)  # Found it! 🎉
    
    return matches

# Let's try it!
text = "The cat in the hat sat on the mat"
pattern = "at"
positions = find_pattern_simple(text, pattern)
print(f"Found '{pattern}' at positions: {positions}")
# Output: Found 'at' at positions: [5, 16, 24, 34]

How It Works - Step by Step

1

Start at the beginning: Place the pattern at position 0 of the text

2

Compare characters: Check each character of the pattern with the text

3

Match or Move: If all match, record position. Either way, slide pattern one position right

4

Repeat: Continue until the pattern can't fit in remaining text

Common Mistake: Off-by-One Error

# ❌ WRONG - This will miss the last possible position!
for start in range(len(text) - len(pattern)):  # Missing + 1
    # ... rest of code

Why it's wrong: If text="ABC" and pattern="BC", we need to check position 1, but range(3-2) only gives us position 0!

The Right Way

# ✅ CORRECT - Includes all valid positions
for start in range(len(text) - len(pattern) + 1):
    # Now we check all positions where pattern can fit

Common Pattern Matching Algorithms

Intermediate

KMP Algorithm - The Smart Skipper

KMP (Knuth-Morris-Pratt) is like having a smart bookmark that remembers partial matches!

Interactive KMP Visualizer

See how KMP builds its "failure function" to skip unnecessary comparisons:

KMP Algorithm - Skip the Redundant Checks! 🏃‍♂️

# KMP - It's like having a smart assistant that remembers!
def build_lps_array(pattern):
    """LPS = Longest Proper Prefix which is also Suffix
    It tells us: 'If we fail at position i, where should we restart?'"""
    
    lps = [0] * len(pattern)
    length = 0  # Length of previous longest prefix suffix
    i = 1
    
    while i < len(pattern):
        if pattern[i] == pattern[length]:
            # We found a match! Extend the prefix
            length += 1
            lps[i] = length
            i += 1
        else:
            if length != 0:
                # Try shorter prefix
                length = lps[length - 1]
            else:
                # No prefix matches
                lps[i] = 0
                i += 1
    
    return lps

def kmp_search(text, pattern):
    """KMP - Smart pattern matching that never goes backward!"""
    lps = build_lps_array(pattern)
    matches = []
    
    text_idx = 0
    pattern_idx = 0
    
    while text_idx < len(text):
        if text[text_idx] == pattern[pattern_idx]:
            # Characters match! Move both forward
            text_idx += 1
            pattern_idx += 1
        
        if pattern_idx == len(pattern):
            # Found complete match! 🎉
            matches.append(text_idx - pattern_idx)
            pattern_idx = lps[pattern_idx - 1]
        elif text_idx < len(text) and text[text_idx] != pattern[pattern_idx]:
            # Mismatch! Use LPS to skip smartly
            if pattern_idx != 0:
                pattern_idx = lps[pattern_idx - 1]
            else:
                text_idx += 1
    
    return matches

# Example: Finding "ABAB" in text
text = "ABABDABACDABABCABAB"
pattern = "ABAB"
print(f"KMP found matches at: {kmp_search(text, pattern)}")

Intermediate

Rabin-Karp - The Hash Detective

Instead of comparing characters, Rabin-Karp uses math! It calculates a "fingerprint" (hash) for the pattern and checks if any part of the text has the same fingerprint.

Rolling Hash Calculator

See how the hash "rolls" as we slide the window:

Rabin-Karp - Pattern Matching with Math! 🔢

# Rabin-Karp - Like comparing fingerprints instead of faces!
def rabin_karp(text, pattern, prime=101):
    """Uses rolling hash for fast pattern matching"""
    m = len(pattern)
    n = len(text)
    d = 256  # Number of characters in alphabet
    pattern_hash = 0
    text_hash = 0
    h = 1
    matches = []
    
    # Calculate h = d^(m-1) % prime
    for i in range(m - 1):
        h = (h * d) % prime
    
    # Calculate initial hash values
    for i in range(m):
        pattern_hash = (d * pattern_hash + ord(pattern[i])) % prime
        text_hash = (d * text_hash + ord(text[i])) % prime
    
    # Slide the pattern over text
    for i in range(n - m + 1):
        # If hash values match, double-check character by character
        if pattern_hash == text_hash:
            if text[i:i+m] == pattern:
                matches.append(i)  # Found it! 🎯
        
        # Roll the hash to next window
        if i < n - m:
            # Remove leading character, add trailing character
            text_hash = (d * (text_hash - ord(text[i]) * h) + 
                        ord(text[i + m])) % prime
            
            # Handle negative hash value
            if text_hash < 0:
                text_hash += prime
    
    return matches

# Pro tip: Great for finding multiple patterns!
text = "The quick brown fox jumps over the lazy dog"
pattern = "the"
print(f"Found '{pattern}' at: {rabin_karp(text.lower(), pattern)}")

Advanced

Boyer-Moore - The Backwards Genius

Boyer-Moore reads the pattern backwards! When it finds a mismatch, it can skip many positions at once.

Boyer-Moore - Skip Like a Pro! ⚡

# Boyer-Moore - Why check every character when you can skip?
def boyer_moore_simplified(text, pattern):
    """Simplified Boyer-Moore with bad character rule"""
    
    # Build bad character table
    # It tells us: "If we see character X, how far can we skip?"
    bad_char = {}
    for i in range(len(pattern)):
        bad_char[pattern[i]] = i
    
    matches = []
    shift = 0
    
    while shift <= len(text) - len(pattern):
        j = len(pattern) - 1  # Start from the end!
        
        # Compare backwards
        while j >= 0 and pattern[j] == text[shift + j]:
            j -= 1
        
        if j < 0:
            # Found a match! 🎉
            matches.append(shift)
            shift += len(pattern)
        else:
            # Mismatch - skip smartly!
            bad_char_shift = j - bad_char.get(text[shift + j], -1)
            shift += max(1, bad_char_shift)
    
    return matches

# Boyer-Moore shines with long patterns!
text = "HERE IS A SIMPLE EXAMPLE"
pattern = "EXAMPLE"
print(f"Boyer-Moore found: {boyer_moore_simplified(text, pattern)}")

Practice Problems

Beginner

Problem 1: Count Pattern Occurrences

Write a function that counts how many times a pattern appears in a text (including overlapping occurrences).

Think About It First!

If text = "AAAA" and pattern = "AA", how many occurrences are there?

Hint: Don't skip ahead after finding a match!

Show Solution

def count_overlapping(text, pattern):
    count = 0
    for i in range(len(text) - len(pattern) + 1):
        if text[i:i+len(pattern)] == pattern:
            count += 1
    return count

# Test: "AAAA" has 3 overlapping "AA"s
print(count_overlapping("AAAA", "AA"))  # Output: 3

Intermediate

Problem 2: Find All Anagrams

Find all starting positions in text where an anagram of the pattern exists.

Think About It First!

An anagram has the same characters but in any order. How can you check this efficiently?

Show Solution

from collections import Counter

def find_anagrams(text, pattern):
    result = []
    pattern_count = Counter(pattern)
    window_count = Counter(text[:len(pattern)])
    
    # Check first window
    if window_count == pattern_count:
        result.append(0)
    
    # Slide the window
    for i in range(len(pattern), len(text)):
        # Add new character
        window_count[text[i]] += 1
        
        # Remove old character
        old_char = text[i - len(pattern)]
        window_count[old_char] -= 1
        if window_count[old_char] == 0:
            del window_count[old_char]
        
        # Check if current window is anagram
        if window_count == pattern_count:
            result.append(i - len(pattern) + 1)
    
    return result

# Test: Find anagrams of "abc"
print(find_anagrams("cbaebabacd", "abc"))  # Output: [0, 6]

Advanced

Problem 3: Longest Repeating Substring

Find the longest substring that appears at least twice in the text.

Think About It First!

You'll need to check all possible substrings. How can you optimize this?

Show Solution

def longest_repeating_substring(text):
    n = len(text)
    longest = ""
    
    # Try all possible lengths, starting from longest
    for length in range(n - 1, 0, -1):
        seen = set()
        for i in range(n - length + 1):
            substring = text[i:i + length]
            if substring in seen:
                return substring  # Found repeating substring!
            seen.add(substring)
    
    return longest

# Test
print(longest_repeating_substring("banana"))  # Output: "ana"
print(longest_repeating_substring("abcdefg"))  # Output: ""

Advanced Algorithms

Advanced

Z-Algorithm - The Prefix Master

The Z-algorithm creates an array where Z[i] tells you the length of the longest substring starting from position i that matches a prefix of the string.

Z-Algorithm - Linear Time Magic! ✨

def z_algorithm(s):
    """Build Z array in linear time"""
    n = len(s)
    z = [0] * n
    l, r = 0, 0  # Z-box boundaries
    
    for i in range(1, n):
        if i <= r:
            # We're inside a Z-box, use previous values
            z[i] = min(r - i + 1, z[i - l])
        
        # Try to extend the match
        while i + z[i] < n and s[z[i]] == s[i + z[i]]:
            z[i] += 1
        
        # Update Z-box if we found a longer match
        if i + z[i] - 1 > r:
            l, r = i, i + z[i] - 1
    
    return z

def z_pattern_search(text, pattern):
    """Use Z-algorithm for pattern matching"""
    # Concatenate pattern + separator + text
    combined = pattern + "$" + text
    z = z_algorithm(combined)
    
    matches = []
    pattern_len = len(pattern)
    
    # Check where Z value equals pattern length
    for i in range(pattern_len + 1, len(combined)):
        if z[i] == pattern_len:
            matches.append(i - pattern_len - 1)
    
    return matches

Advanced

Aho-Corasick - Multiple Pattern Search Champion

When you need to find multiple patterns at once (like virus signatures or banned words), Aho-Corasick builds a trie of all patterns and searches them simultaneously!

Aho-Corasick - Find All Patterns at Once! 🎯

class TrieNode:
    def __init__(self):
        self.children = {}
        self.is_end = False
        self.patterns = []  # Patterns ending at this node

def find_multiple_patterns(text, patterns):
    """Simple version of Aho-Corasick for multiple patterns"""
    
    # Build trie from patterns
    root = TrieNode()
    for pattern in patterns:
        node = root
        for char in pattern:
            if char not in node.children:
                node.children[char] = TrieNode()
            node = node.children[char]
        node.is_end = True
        node.patterns.append(pattern)
    
    # Search all patterns simultaneously
    results = {pattern: [] for pattern in patterns}
    
    for start in range(len(text)):
        node = root
        for i in range(start, len(text)):
            if text[i] not in node.children:
                break
            node = node.children[text[i]]
            if node.is_end:
                for pattern in node.patterns:
                    results[pattern].append(start)
    
    return results

# Find multiple patterns in one pass!
text = "ushers"
patterns = ["he", "she", "hers"]
matches = find_multiple_patterns(text, patterns)
print(f"Found: {matches}")
# Output: {'he': [2], 'she': [1], 'hers': [2]}

Quick Reference

Algorithm Comparison Chart

Algorithm	Time Complexity	When to Use	Plain English Speed
`Naive`	O(n × m)	Short patterns, simple cases	🐢 Slow but simple
`KMP`	O(n + m)	Need guaranteed linear time	⚡ Fast and consistent
`Rabin-Karp`	O(n + m) average	Multiple pattern search	🎲 Usually fast
`Boyer-Moore`	O(n/m) best case	Long patterns, large alphabet	🚀 Super fast for long patterns
`Z-Algorithm`	O(n + m)	Need prefix matches	⚡ Linear and elegant
`Aho-Corasick`	O(n + m + z)	Multiple patterns at once	💫 Find everything in one pass

Use Naive When...

Pattern is very short (< 10 chars)
Text is small
Simplicity matters most
One-time search

Use KMP When...

Pattern has repetitive parts
Need guaranteed O(n) time
Streaming data
Can't go backwards

Use Boyer-Moore When...

Pattern is long
Large alphabet (like English)
Speed is critical
Text is very long

Use Rabin-Karp When...

Searching multiple patterns
Pattern matching in 2D
Plagiarism detection
Rolling hash useful

Common Pitfalls to Avoid

Pitfall #1: Hash Collisions in Rabin-Karp

Always double-check matches character by character when hashes match!

Pitfall #2: Off-by-One in Loop Bounds

Remember: range(len(text) - len(pattern) + 1) not just - len(pattern)

Pitfall #3: Not Handling Empty Patterns

Always check if pattern is empty before searching!