Strings

Easy35 min read

String Basics

Why Text Handling Matters

The Problem: Most real-world data starts as text — log lines, API responses, user input, file paths — and getting it wrong cascades into encoding bugs and security holes.

The Solution: Python 3 makes Unicode the default, separates text (str) from bytes, and offers f-strings for ergonomic formatting and the re module for industrial-strength regex.

Real Impact: Knowing the str API plus f-string format specs turns text-processing tasks from 50 lines of fiddly code into 5 lines of declarative code.

Real-World Analogy

Think of strings as books in a library:

  • str = the book itself — characters in order, immutable
  • bytes = the raw scan of every page — encoding-specific
  • encode/decode = translating between the book and its scan
  • f-strings = filling in a Mad Libs template at runtime
  • re = a librarian who finds every page matching a pattern

Python strings (str) are immutable sequences of Unicode code points. Single and double quotes are equivalent; triple quotes (''' or """) allow multi-line literals.

name = "Alice"
bio = '''Multi-line
doc string.'''

# Concatenation, repetition, indexing
greeting = "hello" + " " + name
line = "-" * 20
first = name[0]              # 'A'
last = name[-1]               # 'e'

# Slicing — start:stop:step
print(name[1:4])           # 'lic'
print(name[::-1])          # 'ecilA' — reverse trick

# Membership and iteration
if "li" in name:
    ...
for ch in name:
    print(ch)

Strings are immutable

name[0] = 'B' raises TypeError. Build a new string instead. For lots of incremental building, use a list and ''.join(parts) at the end — O(n) vs O(n²) for repeated concatenation.

f-strings (PEP 498)

Formatted string literals — the modern way to interpolate. Embed any expression inside {}.

name = "Alice"
age = 30
price = 1234.567

print(f"{name} is {age}")
print(f"{name.upper()} — {age * 12} months")

# Format specs after a colon
print(f"{price:.2f}")          # '1234.57'
print(f"{price:,.2f}")         # '1,234.57'
print(f"{42:_>10}")            # '________42' — right-align, fill with _
print(f"{255:#x}")              # '0xff'

# Debug syntax (Python 3.8+)
print(f"{name=}, {age=}")        # name='Alice', age=30

Format Specifiers Quick Reference

SpecMeaningExample
.2fFixed-point, 2 decimals3.14
,Thousands separator1,000,000
eScientific notation1.50e+03
%Percentage{0.25:.0%}25%
>10Right-align in 10 chars' hi'
^10Center in 10 chars' hi '
0>5dZero-pad to 5'00042'

Essential String Methods

s = "  Hello, World!  "

s.strip()              # 'Hello, World!' — removes leading/trailing whitespace
s.lower()              # '  hello, world!  '
s.upper()              # '  HELLO, WORLD!  '
s.title()              # '  Hello, World!  '

# Search
s.find("World")        # 9 (index) or -1 if not found
s.index("World")       # like find but raises ValueError
s.count("l")           # 3
"World" in s          # True
s.startswith("  H")    # True
s.endswith((".", "!", "  "))   # True — accepts tuple of options

# Replace and split
s.replace("World", "Python")
"a,b,c".split(",")       # ['a', 'b', 'c']
"a b\nc".split()         # ['a', 'b', 'c'] — whitespace split
",".join(["a", "b", "c"])     # 'a,b,c'

# Predicates (return bool)
"abc123".isalnum()       # True
"123".isdigit()          # True
"hello".isascii()        # True

Unicode and Encoding

Python 3 strings are Unicode by default. Bytes are a separate type for binary data.

s = "café"
len(s)                       # 4 — code points

# Encode str → bytes (for files, network, etc.)
data = s.encode("utf-8")        # b'caf\xc3\xa9' — 5 bytes
len(data)                    # 5

# Decode bytes → str
data.decode("utf-8")         # 'café'

# Get the code point of a character
ord("é")                  # 233
chr(233)                   # 'é'

⚠️ Always specify encoding when reading files

open(path, encoding="utf-8") — never rely on the platform default. On Windows it's cp1252 by default, which will mangle non-ASCII text.

Regular Expressions: the re module

import re

# Search anywhere in the string
m = re.search(r"(\d+)-(\d+)", "order 1234-5678 received")
if m:
    print(m.group(0))          # '1234-5678' (whole match)
    print(m.group(1))          # '1234' (first group)
    print(m.groups())          # ('1234', '5678')

# Match must start at beginning
re.match(r"hello", "hello world")

# Find all matches
re.findall(r"\w+@\w+\.\w+", text)

# Substitute
re.sub(r"\s+", " ", "too    much    space")
# 'too much space'

# Compile when reusing
email_re = re.compile(r"[\w.+-]+@[\w.-]+\.\w+")
for line in lines:
    for email in email_re.findall(line):
        ...

Always use raw strings for patterns

r"\d+" instead of "\d+" — otherwise Python interprets backslashes (e.g., \b becomes a backspace character).

🎯 Practice Exercises

Exercise 1: Slugify

Convert "Hello, World! 2026" into "hello-world-2026" — lowercase, alphanumerics only, words joined by dashes.

Exercise 2: Word frequency

Count words in a paragraph. Lowercase, strip punctuation, split on whitespace, return a sorted list of (count, word).

Exercise 3: Email extractor

Use re.findall to extract all email addresses from a multi-line string.

Exercise 4: Format a table

Given a list of dicts with name and score, format them as an aligned ASCII table using f-string format specs.