Observability in Kubernetes
The Three Pillars of Observability
Production Kubernetes clusters need three complementary signals to understand system health: metrics, logs, and traces. Together they give you a complete picture of what is happening and why.
Metrics
Numerical time-series data about system performance: CPU, memory, request rates, error rates, and latency.
Logs
Detailed records of events: application logs, system logs, audit logs, and security events.
Traces
Request flow through services: distributed tracing, service dependencies, performance bottlenecks, and error propagation.
Monitoring Stack Architecture
The standard monitoring pipeline flows from applications through collection to visualization and alerting:
- Applications expose metrics endpoints
- Prometheus scrapes and stores time-series data
- Grafana visualizes metrics in dashboards
- Alertmanager routes notifications to the right teams
Logging Stack Architecture
Centralized logging follows a similar pattern:
- Nodes/Pods generate log output
- Fluentd or Fluent Bit collects and forwards logs
- Elasticsearch stores and indexes log data
- Kibana provides search and visualization
Stack Comparison
| Stack | Components | Use Case | Strengths |
|---|---|---|---|
| Prometheus Stack | Prometheus + Grafana + Alertmanager | Metrics & Monitoring | Native K8s support, powerful queries |
| ELK Stack | Elasticsearch + Logstash + Kibana | Log aggregation | Full-text search, rich visualizations |
| EFK Stack | Elasticsearch + Fluentd + Kibana | K8s logging | Cloud-native, lightweight |
| Loki Stack | Loki + Promtail + Grafana | Lightweight logging | Cost-effective, Prometheus-like |
| Jaeger | Jaeger + OpenTelemetry | Distributed tracing | End-to-end tracing, OpenTracing support |
Key Concepts
- SLI/SLO/SLA: Service Level Indicators/Objectives/Agreements
- Golden Signals: Latency, Traffic, Errors, Saturation
- RED Method: Rate, Errors, Duration (for services)
- USE Method: Utilization, Saturation, Errors (for resources)
Prometheus Monitoring
Why Prometheus?
Prometheus is the de-facto standard for Kubernetes monitoring. It uses a pull-based model to scrape metrics, stores them as time-series data, and provides PromQL for powerful querying.
Installing Prometheus
# Add Prometheus community Helm repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Install kube-prometheus-stack (includes Prometheus, Grafana, Alertmanager)
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set prometheus.prometheusSpec.retention=30d \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi \
--set grafana.adminPassword=admin123 \
--set alertmanager.alertmanagerSpec.storage.volumeClaimTemplate.spec.resources.requests.storage=10Gi
# Verify installation
kubectl get pods -n monitoring
kubectl get svc -n monitoring
# Port-forward to access UIs
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-alertmanager 9093:9093
ServiceMonitor Configuration
A ServiceMonitor tells Prometheus which services to scrape and how often:
# Application with Prometheus metrics endpoint
apiVersion: apps/v1
kind: Deployment
metadata:
name: sample-app
spec:
replicas: 3
selector:
matchLabels:
app: sample-app
template:
metadata:
labels:
app: sample-app
spec:
containers:
- name: app
image: myapp:latest
ports:
- containerPort: 8080
name: http
- containerPort: 9090
name: metrics
---
# ServiceMonitor to scrape metrics
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: sample-app-monitor
labels:
prometheus: kube-prometheus
spec:
selector:
matchLabels:
app: sample-app
endpoints:
- port: metrics
interval: 30s
path: /metrics
honorLabels: true
Custom Metrics in Applications
package main
import (
"net/http"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
// Counter - only goes up (e.g., total requests)
requestsTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total number of HTTP requests",
},
[]string{"method", "endpoint", "status"},
)
// Gauge - can go up or down (e.g., active connections)
activeConnections = prometheus.NewGauge(
prometheus.GaugeOpts{
Name: "active_connections",
Help: "Number of active connections",
},
)
// Histogram - distribution of values (e.g., latency buckets)
requestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request latencies in seconds",
Buckets: prometheus.DefBuckets,
},
[]string{"method", "endpoint"},
)
)
func init() {
prometheus.MustRegister(requestsTotal)
prometheus.MustRegister(activeConnections)
prometheus.MustRegister(requestDuration)
}
func main() {
http.Handle("/metrics", promhttp.Handler())
http.ListenAndServe(":8080", nil)
}
PromQL Queries
CPU Usage
rate(container_cpu_usage_seconds_total[5m]) * 100
CPU usage percentage over 5 minutes
Memory Usage
container_memory_working_set_bytes
/ container_spec_memory_limit_bytes * 100
Memory usage percentage
Request Rate
sum(rate(http_requests_total[5m]))
by (service)
Requests per second by service
Error Rate
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
5xx error percentage
P95 Latency
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m]))
95th percentile latency
Pod Restarts
increase(
kube_pod_container_status_restarts_total[1h])
Container restarts in last hour
Alert Rules
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: app-alerts
namespace: monitoring
spec:
groups:
- name: app.rules
interval: 30s
rules:
# High CPU Usage
- alert: HighCPUUsage
expr: |
(sum(rate(container_cpu_usage_seconds_total[5m])) by (pod, namespace) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on pod {{ $labels.pod }}"
description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} CPU above 80%"
# High Memory Usage
- alert: HighMemoryUsage
expr: |
(container_memory_working_set_bytes / container_spec_memory_limit_bytes) * 100 > 90
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on pod {{ $labels.pod }}"
# Pod Crash Looping
- alert: PodCrashLooping
expr: |
increase(kube_pod_container_status_restarts_total[15m]) > 3
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.pod }} is crash looping"
# High Error Rate
- alert: HighErrorRate
expr: |
(sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/ sum(rate(http_requests_total[5m])) by (service)) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate for {{ $labels.service }}"
Prometheus Best Practices
- Use appropriate metric types (Counter, Gauge, Histogram, Summary)
- Keep cardinality low - avoid high-cardinality labels
- Use recording rules for frequently-used complex queries
- Set appropriate retention periods based on storage capacity
- Use federation for multi-cluster monitoring
Grafana Dashboards
Dashboards as Code
Store your Grafana dashboards as JSON in ConfigMaps so they are version-controlled and automatically provisioned when the cluster is rebuilt.
Dashboard Configuration
{
"dashboard": {
"title": "Kubernetes Application Dashboard",
"panels": [
{
"id": 1,
"title": "CPU Usage",
"type": "graph",
"targets": [{
"expr": "sum(rate(container_cpu_usage_seconds_total{namespace=\"$namespace\"}[5m])) by (pod) * 100",
"legendFormat": "{{pod}}"
}]
},
{
"id": 2,
"title": "Memory Usage",
"type": "graph",
"targets": [{
"expr": "sum(container_memory_working_set_bytes{namespace=\"$namespace\"}) by (pod) / 1024 / 1024",
"legendFormat": "{{pod}}"
}]
},
{
"id": 3,
"title": "Error Rate",
"type": "singlestat",
"targets": [{
"expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100"
}],
"thresholds": "1,5",
"colors": ["green", "yellow", "red"]
}
],
"templating": {
"list": [{
"name": "namespace",
"type": "query",
"query": "label_values(kube_pod_info, namespace)"
}]
},
"refresh": "30s"
}
}
ConfigMap for Dashboard Provisioning
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-dashboards
namespace: monitoring
labels:
grafana_dashboard: "1"
data:
k8s-cluster-dashboard.json: |
{ /* Dashboard JSON here */ }
---
# Data sources configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-datasources
namespace: monitoring
data:
datasources.yaml: |
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
url: http://prometheus-kube-prometheus-prometheus:9090
isDefault: true
- name: Loki
type: loki
url: http://loki:3100
- name: Jaeger
type: jaeger
url: http://jaeger-query:16686
Grafana Tips
- Use variables for dynamic dashboards
- Import community dashboards from grafana.com
- Set up alert notifications (Slack, PagerDuty, etc.)
- Use annotations to mark deployments and incidents
Centralized Logging
EFK Stack Setup
# Elasticsearch StatefulSet
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: elasticsearch
namespace: logging
spec:
serviceName: elasticsearch
replicas: 3
selector:
matchLabels:
app: elasticsearch
template:
metadata:
labels:
app: elasticsearch
spec:
initContainers:
- name: init-sysctl
image: busybox
command: ["sysctl", "-w", "vm.max_map_count=262144"]
securityContext:
privileged: true
containers:
- name: elasticsearch
image: docker.elastic.co/elasticsearch/elasticsearch:7.15.0
ports:
- containerPort: 9200
name: rest
- containerPort: 9300
name: transport
env:
- name: cluster.name
value: k8s-logs
- name: ES_JAVA_OPTS
value: "-Xms1g -Xmx1g"
volumeMounts:
- name: data
mountPath: /usr/share/elasticsearch/data
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 30Gi
---
# Fluentd DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluentd
namespace: logging
spec:
selector:
matchLabels:
app: fluentd
template:
metadata:
labels:
app: fluentd
spec:
tolerations:
- key: node-role.kubernetes.io/master
effect: NoSchedule
containers:
- name: fluentd
image: fluent/fluentd-kubernetes-daemonset:v1-debian-elasticsearch
env:
- name: FLUENT_ELASTICSEARCH_HOST
value: "elasticsearch.logging.svc.cluster.local"
- name: FLUENT_ELASTICSEARCH_PORT
value: "9200"
volumeMounts:
- name: varlog
mountPath: /var/log
- name: containers
mountPath: /var/lib/docker/containers
readOnly: true
volumes:
- name: varlog
hostPath:
path: /var/log
- name: containers
hostPath:
path: /var/lib/docker/containers
---
# Kibana Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: kibana
namespace: logging
spec:
replicas: 1
selector:
matchLabels:
app: kibana
template:
metadata:
labels:
app: kibana
spec:
containers:
- name: kibana
image: docker.elastic.co/kibana/kibana:7.15.0
ports:
- containerPort: 5601
env:
- name: ELASTICSEARCH_HOSTS
value: "http://elasticsearch:9200"
Fluentd Configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: fluentd-config
namespace: logging
data:
fluent.conf: |
# Tail container logs
<source>
@type tail
path /var/log/containers/*.log
pos_file /var/log/fluentd-containers.log.pos
tag kubernetes.*
read_from_head true
<parse>
@type json
time_format %Y-%m-%dT%H:%M:%S.%NZ
</parse>
</source>
# Enrich with Kubernetes metadata
<filter kubernetes.**>
@type kubernetes_metadata
</filter>
# Output to Elasticsearch
<match **>
@type elasticsearch
host elasticsearch.logging.svc.cluster.local
port 9200
logstash_format true
logstash_prefix k8s
<buffer>
@type memory
flush_interval 5s
chunk_limit_size 2M
</buffer>
</match>
Loki Stack (Lightweight Alternative)
# Install Loki Stack with Helm
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
helm install loki grafana/loki-stack \
--namespace logging \
--create-namespace \
--set grafana.enabled=false \
--set prometheus.enabled=false \
--set loki.persistence.enabled=true \
--set loki.persistence.size=10Gi \
--set promtail.enabled=true
# Verify installation
kubectl get pods -n logging
# Add Loki data source in Grafana
# URL: http://loki.logging.svc.cluster.local:3100
Logging Considerations
- Always configure log rotation and retention policies
- Never log sensitive information (passwords, tokens, PII)
- Use structured logging (JSON) for easier parsing and searching
- Implement log sampling for high-volume applications
- Consider storage costs for long-term retention
Distributed Tracing
Why Distributed Tracing?
In microservices architectures, a single user request may touch dozens of services. Distributed tracing follows a request end-to-end, showing exactly where time is spent and where errors occur.
Jaeger Installation
# Install Jaeger Operator
kubectl create namespace observability
kubectl create -f https://github.com/jaegertracing/jaeger-operator/releases/download/v1.37.0/jaeger-operator.yaml -n observability
---
# Jaeger Instance
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: jaeger
namespace: observability
spec:
strategy: production
storage:
type: elasticsearch
options:
es:
server-urls: http://elasticsearch:9200
agent:
strategy: DaemonSet
collector:
replicas: 2
autoscale: true
maxReplicas: 5
query:
replicas: 2
OpenTelemetry Integration
package main
import (
"context"
"log"
"net/http"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/exporters/jaeger"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
"go.opentelemetry.io/otel/trace"
)
func initTracer() func() {
// Create Jaeger exporter
exp, err := jaeger.New(jaeger.WithCollectorEndpoint(
jaeger.WithEndpoint("http://jaeger-collector:14268/api/traces"),
))
if err != nil {
log.Fatal(err)
}
// Create and register trace provider
tp := sdktrace.NewTracerProvider(sdktrace.WithBatcher(exp))
otel.SetTracerProvider(tp)
return func() { tp.Shutdown(context.Background()) }
}
func handleRequest(w http.ResponseWriter, r *http.Request) {
tracer := otel.Tracer("my-service")
ctx, span := tracer.Start(r.Context(), "handleRequest",
trace.WithAttributes(
attribute.String("http.method", r.Method),
attribute.String("http.url", r.URL.String()),
),
)
defer span.End()
// Child span for database call
_, dbSpan := tracer.Start(ctx, "database.query")
// ... do DB work ...
dbSpan.End()
w.WriteHeader(http.StatusOK)
}
func main() {
cleanup := initTracer()
defer cleanup()
http.HandleFunc("/", handleRequest)
log.Fatal(http.ListenAndServe(":8080", nil))
}
Tracing Best Practices
- Use sampling to reduce overhead (1-10% in production)
- Add meaningful span attributes and events
- Implement context propagation across service boundaries
- Correlate traces with logs and metrics for full observability
- Set up trace-based alerting for SLO compliance
Practice Problems
Medium Deploy a Complete Observability Stack
Deploy a full observability stack for a microservices application with Prometheus for metrics, EFK for logging, and Jaeger for tracing.
Start by creating separate namespaces (monitoring, logging, tracing). Use Helm for Prometheus and Loki. Deploy Elasticsearch as a StatefulSet. Wire up ServiceMonitors for metric scraping.
# 1. Create namespaces
kubectl create namespace monitoring
kubectl create namespace logging
kubectl create namespace tracing
# 2. Prometheus Stack (includes Grafana)
helm install kube-prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false
# 3. EFK Stack
# Deploy Elasticsearch, Fluentd DaemonSet, and Kibana
# (use manifests from the Centralized Logging section above)
# 4. Jaeger
kubectl apply -f https://github.com/jaegertracing/jaeger-operator/releases/download/v1.37.0/jaeger-operator.yaml -n tracing
# 5. Wire up data sources in Grafana
# Prometheus: http://prometheus-kube-prometheus-prometheus:9090
# Elasticsearch: http://elasticsearch.logging:9200
# Jaeger: http://jaeger-query.tracing:16686
Medium Implement SLO Monitoring
Define SLIs for a web service (availability, latency, error rate), create SLO targets (99.9% availability), and implement error budget tracking with multi-window alerts.
Use Prometheus recording rules to calculate SLIs over different windows (5m, 30m, 1h). The error budget is: 1 - ((1 - actual_availability) / (1 - slo_target)). Use multi-window alerting so fast burns trigger faster alerts.
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: slo-rules
namespace: monitoring
spec:
groups:
- name: slo.rules
rules:
# Availability SLI
- record: sli:availability:ratio_rate5m
expr: |
sum(rate(http_requests_total{status!~"5.."}[5m])) by (service)
/ sum(rate(http_requests_total[5m])) by (service)
# Error Budget remaining
- record: error_budget:remaining
expr: |
1 - ((1 - sli:availability:ratio_rate5m) / (1 - 0.999))
# Multi-window alert
- alert: SLOAvailabilityBreach
expr: |
sli:availability:ratio_rate5m < 0.999
AND sli:availability:ratio_rate1h < 0.999
for: 5m
labels:
severity: critical
annotations:
summary: "SLO breach for {{ $labels.service }}"
Hard Multi-Cluster Observability
Design and implement observability for a multi-cluster Kubernetes deployment with Prometheus federation, centralized logging, cross-cluster tracing, and Thanos for long-term metric storage.
Use Thanos sidecar on each cluster's Prometheus to upload blocks to object storage. A central Thanos Querier aggregates data from all clusters. For logging, ship logs from each cluster's Fluentd to a centralized Elasticsearch cluster. Use Jaeger with shared storage for cross-cluster trace correlation.
# Thanos sidecar for each cluster's Prometheus
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: prometheus
spec:
thanos:
image: quay.io/thanos/thanos:v0.30.2
objectStorageConfig:
name: thanos-objstore-config
key: thanos.yaml
version: v0.30.2
---
# Central Thanos Querier
apiVersion: apps/v1
kind: Deployment
metadata:
name: thanos-querier
spec:
template:
spec:
containers:
- name: thanos-query
image: quay.io/thanos/thanos:v0.30.2
args:
- query
- --store=dnssrv+_grpc._tcp.thanos-store.monitoring.svc
- --store=cluster-a-prometheus:10901
- --store=cluster-b-prometheus:10901
Quick Reference
Essential Commands
| Task | Command |
|---|---|
| Check Prometheus targets | kubectl port-forward svc/prometheus 9090 -n monitoring |
| View pod logs | kubectl logs -f deployment/app --all-containers |
| Check cluster events | kubectl get events --sort-by='.lastTimestamp' |
| Resource usage | kubectl top pods --sort-by=cpu |
| Describe ServiceMonitor | kubectl get servicemonitors -n monitoring |
The Four Golden Signals
Latency
Time to service a request. Track both successful and failed request latencies separately.
Traffic
Demand on the system: HTTP requests/sec, transactions/sec, or sessions.
Errors
Rate of failed requests: explicit (5xx), implicit (wrong content), or policy-based (> 1s latency).
Saturation
How "full" the service is: CPU, memory, I/O, queue depth. Signals capacity limits before failure.
Remember
Good observability is not about collecting everything - it is about collecting the right signals and making them actionable. Start with the Golden Signals, add custom metrics for your business logic, and build dashboards that help you answer questions during incidents.