Measure First, Optimize Second
Why Performance Matters
The Problem: Naive Python can be 50x slower than necessary. Optimizing the wrong thing wastes weeks and complicates code for no gain.
The Solution: Profile first (cProfile, py-spy), then optimize the hot path: built-ins over loops, NumPy for arrays, caching for pure functions, multiprocessing for CPU, asyncio for I/O.
Real Impact: Measured optimization turns slow scripts into responsive services without rewriting them in C.
Real-World Analogy
Think of performance work as tuning a race car:
- Profiler = the lap timer that says which corner you're losing seconds in
- Hot loop = the corner where 80% of your time goes — fix this one first
- NumPy = swapping bicycle wheels for racing slicks for numeric work
- Cache = memorizing the racing line so you don't recompute it every lap
- PyPy / Cython = swapping the engine for a turbocharged one when the chassis is right
Donald Knuth's law: "Premature optimization is the root of all evil." Always profile before changing code — humans are terrible at guessing where time is spent.
Quick Timing with timeit
import timeit
# Time a snippet — runs many iterations, returns total seconds
timeit.timeit("sum(range(1000))", number=10000)
# Compare two implementations
t1 = timeit.timeit("sum([x*x for x in range(100)])", number=10000)
t2 = timeit.timeit("sum(x*x for x in range(100))", number=10000)
print(f"list comp: {t1:.3f}s, gen exp: {t2:.3f}s")
$ python -m timeit -s 'data = list(range(10000))' 'sorted(data)'
10000 loops, best of 5: 327 usec per loop
cProfile — Where Time Is Spent
$ python -m cProfile -s cumulative app.py | head -20
5083 function calls in 0.052 seconds
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.052 0.052 app.py:1(<module>)
1 0.000 0.000 0.045 0.045 app.py:5(slow_function)
5000 0.030 0.000 0.030 0.000 app.py:8(inner)
Inline Profiling
import cProfile
cProfile.run("slow_function()", sort="cumulative")
py-spy — Sampling Profiler for Production
py-spy attaches to a running process without changing it — perfect for profiling live services.
$ pip install py-spy
$ py-spy top --pid 12345 # live top-like view
$ py-spy record -o flame.svg --pid 12345 # produce a flame graph
Common Speedups
1. Use Built-ins and the Standard Library
Built-ins are implemented in C. sum, min, max, any, all, sorted, map, filter are far faster than Python loops doing the same thing.
2. Comprehensions over append loops
# Slow
result = []
for x in items:
result.append(x * 2)
# Faster — ~25% in CPython
result = [x * 2 for x in items]
3. Local variable lookups beat global
# In a hot loop, bind globals to locals
def crunch(data):
_sin = math.sin # local binding
return [_sin(x) for x in data]
4. Use sets for membership tests
# O(n) — slow for big lists
if item in ["a", "b", "c", "d", ...]:
...
# O(1) — preferred
ALLOWED = {"a", "b", "c", "d", ...}
if item in ALLOWED:
...
5. Concatenate strings via join
# O(n²) — repeated string concat creates new objects each time
s = ""
for chunk in chunks:
s += chunk
# O(n) — build a list then join
s = "".join(chunks)
NumPy for Numeric Work
For arrays of numbers, NumPy is 10-100× faster than pure Python because operations are vectorized in C.
import numpy as np
# Pure Python — slow
result = [a * b for a, b in zip(xs, ys)]
# NumPy — vectorized, ~50x faster for big arrays
xs = np.array(xs)
ys = np.array(ys)
result = xs * ys # element-wise
Avoid Python loops over NumPy arrays
If you write for x in arr, you've lost most of the speedup. Use vectorized operations: arr + 1, np.where, arr.sum(axis=0), etc.
Compiling Hot Code
functools.cache for Memoization
from functools import cache
@cache
def fib(n):
return n if n < 2 else fib(n-1) + fib(n-2)
fib(100) # instant — without cache, this would take forever
Cython / mypyc / Nuitka — Compile to C
For CPU-bound hot paths, compiling Python to C extensions gives 10-100× speedup with minimal code changes. Cython is the most popular.
PyPy — Drop-in JIT
PyPy is an alternative Python implementation with a JIT compiler. Long-running CPU-bound programs see 5-10× speedups without code changes. Caveat: slower startup, some C extensions don't work.
🎯 Practice Exercises
Exercise 1: Profile a script
Run cProfile on a script you wrote. Identify the function with highest cumulative time.
Exercise 2: timeit comparison
Compare list comprehension, generator expression, and explicit loop for summing squares of 10k integers.
Exercise 3: NumPy speedup
Implement matrix multiplication (small matrices) in pure Python vs NumPy. Measure both.
Exercise 4: Memoize
Take a slow recursive function. Add @functools.cache. Compare timings.