Pattern Matching

Hard 25 min read

Pattern Matching Algorithms

Master the art of finding patterns in text! From simple search to advanced algorithms like KMP and Boyer-Moore, learn how search engines, DNA analyzers, and text editors find what they're looking for.

⏱️ 45-60 mins 📊 Intermediate Level 🎮 6 Interactive Demos 💡 Real-world Applications

Why Should You Care About Pattern Matching?

Pattern matching is everywhere in modern technology! Let's see where you encounter it daily:

📝

Text Editors & IDEs

When you press Ctrl+F to find text, or when your IDE highlights syntax errors, that's pattern matching in action! VS Code uses Boyer-Moore for fast text search.

🧬

DNA Sequencing

Scientists use pattern matching to find specific gene sequences in DNA strands. A single human genome has 3 billion base pairs - imagine finding a specific sequence without efficient algorithms!

🔎

Search Engines

Google processes billions of searches daily. They use advanced pattern matching to find your query in trillions of web pages. Without efficient algorithms, searches would take forever!

🛡️

Security & Antivirus

Antivirus software scans files for known virus signatures using pattern matching. They need to check millions of patterns quickly without slowing down your computer.

📚

Plagiarism Detection

Tools like Turnitin use sophisticated pattern matching to compare submitted work against billions of documents to detect potential plagiarism.

The Basics - Let's Start Simple!

What is Pattern Matching?

Imagine you're reading a book and looking for every occurrence of the word "magic". Pattern matching is teaching a computer to do exactly that - find a specific pattern (like "magic") in a larger text (like your book).

Try It: Naive Pattern Matching

See how the simplest algorithm checks every position:

Your First Pattern Matching Algorithm 🎯
# The Naive Approach - Simple but it works! def find_pattern_simple(text, pattern): # Think of it like sliding a ruler along a page matches = [] # Try every starting position in the text for start in range(len(text) - len(pattern) + 1): # Check if pattern matches at this position found = True for i in range(len(pattern)): if text[start + i] != pattern[i]: found = False break # No match, try next position if found: matches.append(start) # Found it! 🎉 return matches # Let's try it! text = "The cat in the hat sat on the mat" pattern = "at" positions = find_pattern_simple(text, pattern) print(f"Found '{pattern}' at positions: {positions}") # Output: Found 'at' at positions: [5, 16, 24, 34]

How It Works - Step by Step

1
Start at the beginning: Place the pattern at position 0 of the text
2
Compare characters: Check each character of the pattern with the text
3
Match or Move: If all match, record position. Either way, slide pattern one position right
4
Repeat: Continue until the pattern can't fit in remaining text

Common Mistake: Off-by-One Error

# ❌ WRONG - This will miss the last possible position! for start in range(len(text) - len(pattern)): # Missing + 1 # ... rest of code

Why it's wrong: If text="ABC" and pattern="BC", we need to check position 1, but range(3-2) only gives us position 0!

The Right Way

# ✅ CORRECT - Includes all valid positions for start in range(len(text) - len(pattern) + 1): # Now we check all positions where pattern can fit

Common Pattern Matching Algorithms

Intermediate

KMP Algorithm - The Smart Skipper

KMP (Knuth-Morris-Pratt) is like having a smart bookmark that remembers partial matches!

Interactive KMP Visualizer

See how KMP builds its "failure function" to skip unnecessary comparisons:

KMP Algorithm - Skip the Redundant Checks! 🏃‍♂️
# KMP - It's like having a smart assistant that remembers! def build_lps_array(pattern): """LPS = Longest Proper Prefix which is also Suffix It tells us: 'If we fail at position i, where should we restart?'""" lps = [0] * len(pattern) length = 0 # Length of previous longest prefix suffix i = 1 while i < len(pattern): if pattern[i] == pattern[length]: # We found a match! Extend the prefix length += 1 lps[i] = length i += 1 else: if length != 0: # Try shorter prefix length = lps[length - 1] else: # No prefix matches lps[i] = 0 i += 1 return lps def kmp_search(text, pattern): """KMP - Smart pattern matching that never goes backward!""" lps = build_lps_array(pattern) matches = [] text_idx = 0 pattern_idx = 0 while text_idx < len(text): if text[text_idx] == pattern[pattern_idx]: # Characters match! Move both forward text_idx += 1 pattern_idx += 1 if pattern_idx == len(pattern): # Found complete match! 🎉 matches.append(text_idx - pattern_idx) pattern_idx = lps[pattern_idx - 1] elif text_idx < len(text) and text[text_idx] != pattern[pattern_idx]: # Mismatch! Use LPS to skip smartly if pattern_idx != 0: pattern_idx = lps[pattern_idx - 1] else: text_idx += 1 return matches # Example: Finding "ABAB" in text text = "ABABDABACDABABCABAB" pattern = "ABAB" print(f"KMP found matches at: {kmp_search(text, pattern)}")
Intermediate

Rabin-Karp - The Hash Detective

Instead of comparing characters, Rabin-Karp uses math! It calculates a "fingerprint" (hash) for the pattern and checks if any part of the text has the same fingerprint.

Rolling Hash Calculator

See how the hash "rolls" as we slide the window:

Rabin-Karp - Pattern Matching with Math! 🔢
# Rabin-Karp - Like comparing fingerprints instead of faces! def rabin_karp(text, pattern, prime=101): """Uses rolling hash for fast pattern matching""" m = len(pattern) n = len(text) d = 256 # Number of characters in alphabet pattern_hash = 0 text_hash = 0 h = 1 matches = [] # Calculate h = d^(m-1) % prime for i in range(m - 1): h = (h * d) % prime # Calculate initial hash values for i in range(m): pattern_hash = (d * pattern_hash + ord(pattern[i])) % prime text_hash = (d * text_hash + ord(text[i])) % prime # Slide the pattern over text for i in range(n - m + 1): # If hash values match, double-check character by character if pattern_hash == text_hash: if text[i:i+m] == pattern: matches.append(i) # Found it! 🎯 # Roll the hash to next window if i < n - m: # Remove leading character, add trailing character text_hash = (d * (text_hash - ord(text[i]) * h) + ord(text[i + m])) % prime # Handle negative hash value if text_hash < 0: text_hash += prime return matches # Pro tip: Great for finding multiple patterns! text = "The quick brown fox jumps over the lazy dog" pattern = "the" print(f"Found '{pattern}' at: {rabin_karp(text.lower(), pattern)}")
Advanced

Boyer-Moore - The Backwards Genius

Boyer-Moore reads the pattern backwards! When it finds a mismatch, it can skip many positions at once.

Boyer-Moore - Skip Like a Pro! ⚡
# Boyer-Moore - Why check every character when you can skip? def boyer_moore_simplified(text, pattern): """Simplified Boyer-Moore with bad character rule""" # Build bad character table # It tells us: "If we see character X, how far can we skip?" bad_char = {} for i in range(len(pattern)): bad_char[pattern[i]] = i matches = [] shift = 0 while shift <= len(text) - len(pattern): j = len(pattern) - 1 # Start from the end! # Compare backwards while j >= 0 and pattern[j] == text[shift + j]: j -= 1 if j < 0: # Found a match! 🎉 matches.append(shift) shift += len(pattern) else: # Mismatch - skip smartly! bad_char_shift = j - bad_char.get(text[shift + j], -1) shift += max(1, bad_char_shift) return matches # Boyer-Moore shines with long patterns! text = "HERE IS A SIMPLE EXAMPLE" pattern = "EXAMPLE" print(f"Boyer-Moore found: {boyer_moore_simplified(text, pattern)}")

Practice Problems

Beginner

Problem 1: Count Pattern Occurrences

Write a function that counts how many times a pattern appears in a text (including overlapping occurrences).

Think About It First!

If text = "AAAA" and pattern = "AA", how many occurrences are there?

Hint: Don't skip ahead after finding a match!

Show Solution
def count_overlapping(text, pattern): count = 0 for i in range(len(text) - len(pattern) + 1): if text[i:i+len(pattern)] == pattern: count += 1 return count # Test: "AAAA" has 3 overlapping "AA"s print(count_overlapping("AAAA", "AA")) # Output: 3
Intermediate

Problem 2: Find All Anagrams

Find all starting positions in text where an anagram of the pattern exists.

Think About It First!

An anagram has the same characters but in any order. How can you check this efficiently?

Show Solution
from collections import Counter def find_anagrams(text, pattern): result = [] pattern_count = Counter(pattern) window_count = Counter(text[:len(pattern)]) # Check first window if window_count == pattern_count: result.append(0) # Slide the window for i in range(len(pattern), len(text)): # Add new character window_count[text[i]] += 1 # Remove old character old_char = text[i - len(pattern)] window_count[old_char] -= 1 if window_count[old_char] == 0: del window_count[old_char] # Check if current window is anagram if window_count == pattern_count: result.append(i - len(pattern) + 1) return result # Test: Find anagrams of "abc" print(find_anagrams("cbaebabacd", "abc")) # Output: [0, 6]
Advanced

Problem 3: Longest Repeating Substring

Find the longest substring that appears at least twice in the text.

Think About It First!

You'll need to check all possible substrings. How can you optimize this?

Show Solution
def longest_repeating_substring(text): n = len(text) longest = "" # Try all possible lengths, starting from longest for length in range(n - 1, 0, -1): seen = set() for i in range(n - length + 1): substring = text[i:i + length] if substring in seen: return substring # Found repeating substring! seen.add(substring) return longest # Test print(longest_repeating_substring("banana")) # Output: "ana" print(longest_repeating_substring("abcdefg")) # Output: ""

Advanced Algorithms

Advanced

Z-Algorithm - The Prefix Master

The Z-algorithm creates an array where Z[i] tells you the length of the longest substring starting from position i that matches a prefix of the string.

Z-Algorithm - Linear Time Magic! ✨
def z_algorithm(s): """Build Z array in linear time""" n = len(s) z = [0] * n l, r = 0, 0 # Z-box boundaries for i in range(1, n): if i <= r: # We're inside a Z-box, use previous values z[i] = min(r - i + 1, z[i - l]) # Try to extend the match while i + z[i] < n and s[z[i]] == s[i + z[i]]: z[i] += 1 # Update Z-box if we found a longer match if i + z[i] - 1 > r: l, r = i, i + z[i] - 1 return z def z_pattern_search(text, pattern): """Use Z-algorithm for pattern matching""" # Concatenate pattern + separator + text combined = pattern + "$" + text z = z_algorithm(combined) matches = [] pattern_len = len(pattern) # Check where Z value equals pattern length for i in range(pattern_len + 1, len(combined)): if z[i] == pattern_len: matches.append(i - pattern_len - 1) return matches
Advanced

Aho-Corasick - Multiple Pattern Search Champion

When you need to find multiple patterns at once (like virus signatures or banned words), Aho-Corasick builds a trie of all patterns and searches them simultaneously!

Aho-Corasick - Find All Patterns at Once! 🎯
class TrieNode: def __init__(self): self.children = {} self.is_end = False self.patterns = [] # Patterns ending at this node def find_multiple_patterns(text, patterns): """Simple version of Aho-Corasick for multiple patterns""" # Build trie from patterns root = TrieNode() for pattern in patterns: node = root for char in pattern: if char not in node.children: node.children[char] = TrieNode() node = node.children[char] node.is_end = True node.patterns.append(pattern) # Search all patterns simultaneously results = {pattern: [] for pattern in patterns} for start in range(len(text)): node = root for i in range(start, len(text)): if text[i] not in node.children: break node = node.children[text[i]] if node.is_end: for pattern in node.patterns: results[pattern].append(start) return results # Find multiple patterns in one pass! text = "ushers" patterns = ["he", "she", "hers"] matches = find_multiple_patterns(text, patterns) print(f"Found: {matches}") # Output: {'he': [2], 'she': [1], 'hers': [2]}

Quick Reference

Algorithm Comparison Chart

Algorithm Time Complexity When to Use Plain English Speed
Naive O(n × m) Short patterns, simple cases 🐢 Slow but simple
KMP O(n + m) Need guaranteed linear time ⚡ Fast and consistent
Rabin-Karp O(n + m) average Multiple pattern search 🎲 Usually fast
Boyer-Moore O(n/m) best case Long patterns, large alphabet 🚀 Super fast for long patterns
Z-Algorithm O(n + m) Need prefix matches ⚡ Linear and elegant
Aho-Corasick O(n + m + z) Multiple patterns at once 💫 Find everything in one pass

Use Naive When...

  • Pattern is very short (< 10 chars)
  • Text is small
  • Simplicity matters most
  • One-time search

Use KMP When...

  • Pattern has repetitive parts
  • Need guaranteed O(n) time
  • Streaming data
  • Can't go backwards

Use Boyer-Moore When...

  • Pattern is long
  • Large alphabet (like English)
  • Speed is critical
  • Text is very long

Use Rabin-Karp When...

  • Searching multiple patterns
  • Pattern matching in 2D
  • Plagiarism detection
  • Rolling hash useful

Common Pitfalls to Avoid

Pitfall #1: Hash Collisions in Rabin-Karp

Always double-check matches character by character when hashes match!

Pitfall #2: Off-by-One in Loop Bounds

Remember: range(len(text) - len(pattern) + 1) not just - len(pattern)

Pitfall #3: Not Handling Empty Patterns

Always check if pattern is empty before searching!