String Basics
Why Text Handling Matters
The Problem: Most real-world data starts as text — log lines, API responses, user input, file paths — and getting it wrong cascades into encoding bugs and security holes.
The Solution: Python 3 makes Unicode the default, separates text (str) from bytes, and offers f-strings for ergonomic formatting and the re module for industrial-strength regex.
Real Impact: Knowing the str API plus f-string format specs turns text-processing tasks from 50 lines of fiddly code into 5 lines of declarative code.
Real-World Analogy
Think of strings as books in a library:
- str = the book itself — characters in order, immutable
- bytes = the raw scan of every page — encoding-specific
- encode/decode = translating between the book and its scan
- f-strings = filling in a Mad Libs template at runtime
- re = a librarian who finds every page matching a pattern
Python strings (str) are immutable sequences of Unicode code points. Single and double quotes are equivalent; triple quotes (''' or """) allow multi-line literals.
name = "Alice"
bio = '''Multi-line
doc string.'''
# Concatenation, repetition, indexing
greeting = "hello" + " " + name
line = "-" * 20
first = name[0] # 'A'
last = name[-1] # 'e'
# Slicing — start:stop:step
print(name[1:4]) # 'lic'
print(name[::-1]) # 'ecilA' — reverse trick
# Membership and iteration
if "li" in name:
...
for ch in name:
print(ch)
Strings are immutable
name[0] = 'B' raises TypeError. Build a new string instead. For lots of incremental building, use a list and ''.join(parts) at the end — O(n) vs O(n²) for repeated concatenation.
f-strings (PEP 498)
Formatted string literals — the modern way to interpolate. Embed any expression inside {}.
name = "Alice"
age = 30
price = 1234.567
print(f"{name} is {age}")
print(f"{name.upper()} — {age * 12} months")
# Format specs after a colon
print(f"{price:.2f}") # '1234.57'
print(f"{price:,.2f}") # '1,234.57'
print(f"{42:_>10}") # '________42' — right-align, fill with _
print(f"{255:#x}") # '0xff'
# Debug syntax (Python 3.8+)
print(f"{name=}, {age=}") # name='Alice', age=30
Format Specifiers Quick Reference
| Spec | Meaning | Example |
|---|---|---|
.2f | Fixed-point, 2 decimals | 3.14 |
, | Thousands separator | 1,000,000 |
e | Scientific notation | 1.50e+03 |
% | Percentage | {0.25:.0%} → 25% |
>10 | Right-align in 10 chars | ' hi' |
^10 | Center in 10 chars | ' hi ' |
0>5d | Zero-pad to 5 | '00042' |
Essential String Methods
s = " Hello, World! "
s.strip() # 'Hello, World!' — removes leading/trailing whitespace
s.lower() # ' hello, world! '
s.upper() # ' HELLO, WORLD! '
s.title() # ' Hello, World! '
# Search
s.find("World") # 9 (index) or -1 if not found
s.index("World") # like find but raises ValueError
s.count("l") # 3
"World" in s # True
s.startswith(" H") # True
s.endswith((".", "!", " ")) # True — accepts tuple of options
# Replace and split
s.replace("World", "Python")
"a,b,c".split(",") # ['a', 'b', 'c']
"a b\nc".split() # ['a', 'b', 'c'] — whitespace split
",".join(["a", "b", "c"]) # 'a,b,c'
# Predicates (return bool)
"abc123".isalnum() # True
"123".isdigit() # True
"hello".isascii() # True
Unicode and Encoding
Python 3 strings are Unicode by default. Bytes are a separate type for binary data.
s = "café"
len(s) # 4 — code points
# Encode str → bytes (for files, network, etc.)
data = s.encode("utf-8") # b'caf\xc3\xa9' — 5 bytes
len(data) # 5
# Decode bytes → str
data.decode("utf-8") # 'café'
# Get the code point of a character
ord("é") # 233
chr(233) # 'é'
⚠️ Always specify encoding when reading files
open(path, encoding="utf-8") — never rely on the platform default. On Windows it's cp1252 by default, which will mangle non-ASCII text.
Regular Expressions: the re module
import re
# Search anywhere in the string
m = re.search(r"(\d+)-(\d+)", "order 1234-5678 received")
if m:
print(m.group(0)) # '1234-5678' (whole match)
print(m.group(1)) # '1234' (first group)
print(m.groups()) # ('1234', '5678')
# Match must start at beginning
re.match(r"hello", "hello world")
# Find all matches
re.findall(r"\w+@\w+\.\w+", text)
# Substitute
re.sub(r"\s+", " ", "too much space")
# 'too much space'
# Compile when reusing
email_re = re.compile(r"[\w.+-]+@[\w.-]+\.\w+")
for line in lines:
for email in email_re.findall(line):
...
Always use raw strings for patterns
r"\d+" instead of "\d+" — otherwise Python interprets backslashes (e.g., \b becomes a backspace character).
🎯 Practice Exercises
Exercise 1: Slugify
Convert "Hello, World! 2026" into "hello-world-2026" — lowercase, alphanumerics only, words joined by dashes.
Exercise 2: Word frequency
Count words in a paragraph. Lowercase, strip punctuation, split on whitespace, return a sorted list of (count, word).
Exercise 3: Email extractor
Use re.findall to extract all email addresses from a multi-line string.
Exercise 4: Format a table
Given a list of dicts with name and score, format them as an aligned ASCII table using f-string format specs.