Performance Optimization

Measure First, Optimize Second

Why Performance Matters

The Problem: Naive Python can be 50x slower than necessary. Optimizing the wrong thing wastes weeks and complicates code for no gain.

The Solution: Profile first (cProfile, py-spy), then optimize the hot path: built-ins over loops, NumPy for arrays, caching for pure functions, multiprocessing for CPU, asyncio for I/O.

Real Impact: Measured optimization turns slow scripts into responsive services without rewriting them in C.

Real-World Analogy

Think of performance work as tuning a race car:

Profiler = the lap timer that says which corner you're losing seconds in
Hot loop = the corner where 80% of your time goes — fix this one first
NumPy = swapping bicycle wheels for racing slicks for numeric work
Cache = memorizing the racing line so you don't recompute it every lap
PyPy / Cython = swapping the engine for a turbocharged one when the chassis is right

Donald Knuth's law: "Premature optimization is the root of all evil." Always profile before changing code — humans are terrible at guessing where time is spent.

Quick Timing with timeit

import timeit

# Time a snippet — runs many iterations, returns total seconds
timeit.timeit("sum(range(1000))", number=10000)

# Compare two implementations
t1 = timeit.timeit("sum([x*x for x in range(100)])", number=10000)
t2 = timeit.timeit("sum(x*x for x in range(100))", number=10000)
print(f"list comp: {t1:.3f}s, gen exp: {t2:.3f}s")

$ python -m timeit -s 'data = list(range(10000))' 'sorted(data)'
10000 loops, best of 5: 327 usec per loop

cProfile — Where Time Is Spent

$ python -m cProfile -s cumulative app.py | head -20
         5083 function calls in 0.052 seconds
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.052    0.052 app.py:1(<module>)
        1    0.000    0.000    0.045    0.045 app.py:5(slow_function)
     5000    0.030    0.000    0.030    0.000 app.py:8(inner)

Inline Profiling

import cProfile

cProfile.run("slow_function()", sort="cumulative")

py-spy — Sampling Profiler for Production

py-spy attaches to a running process without changing it — perfect for profiling live services.

$ pip install py-spy
$ py-spy top --pid 12345          # live top-like view
$ py-spy record -o flame.svg --pid 12345    # produce a flame graph

Common Speedups

1. Use Built-ins and the Standard Library

Built-ins are implemented in C. sum, min, max, any, all, sorted, map, filter are far faster than Python loops doing the same thing.

2. Comprehensions over append loops

# Slow
result = []
for x in items:
    result.append(x * 2)

# Faster — ~25% in CPython
result = [x * 2 for x in items]

3. Local variable lookups beat global

# In a hot loop, bind globals to locals
def crunch(data):
    _sin = math.sin       # local binding
    return [_sin(x) for x in data]

4. Use sets for membership tests

# O(n) — slow for big lists
if item in ["a", "b", "c", "d", ...]:
    ...

# O(1) — preferred
ALLOWED = {"a", "b", "c", "d", ...}
if item in ALLOWED:
    ...

5. Concatenate strings via join

# O(n²) — repeated string concat creates new objects each time
s = ""
for chunk in chunks:
    s += chunk

# O(n) — build a list then join
s = "".join(chunks)

NumPy for Numeric Work

For arrays of numbers, NumPy is 10-100× faster than pure Python because operations are vectorized in C.

import numpy as np

# Pure Python — slow
result = [a * b for a, b in zip(xs, ys)]

# NumPy — vectorized, ~50x faster for big arrays
xs = np.array(xs)
ys = np.array(ys)
result = xs * ys                  # element-wise

Avoid Python loops over NumPy arrays

If you write for x in arr, you've lost most of the speedup. Use vectorized operations: arr + 1, np.where, arr.sum(axis=0), etc.

Compiling Hot Code

functools.cache for Memoization

from functools import cache

@cache
def fib(n):
    return n if n < 2 else fib(n-1) + fib(n-2)

fib(100)    # instant — without cache, this would take forever

Cython / mypyc / Nuitka — Compile to C

For CPU-bound hot paths, compiling Python to C extensions gives 10-100× speedup with minimal code changes. Cython is the most popular.

PyPy — Drop-in JIT

PyPy is an alternative Python implementation with a JIT compiler. Long-running CPU-bound programs see 5-10× speedups without code changes. Caveat: slower startup, some C extensions don't work.

🎯 Practice Exercises

Exercise 1: Profile a script

Run cProfile on a script you wrote. Identify the function with highest cumulative time.

Exercise 2: timeit comparison

Compare list comprehension, generator expression, and explicit loop for summing squares of 10k integers.

Exercise 3: NumPy speedup

Implement matrix multiplication (small matrices) in pure Python vs NumPy. Measure both.

Exercise 4: Memoize

Take a slow recursive function. Add @functools.cache. Compare timings.