Observability - AI Training Hub

Observability Fundamentals

What is Observability?

The ability to understand the internal state of a system based on its external outputs. For AI systems, this means tracking model performance, data quality, and system health.

Three Pillars of Observability

Metrics: What is happening? (Quantitative data) Logs: Why is it happening? (Event records) Traces: Where is it happening? (Request flow)

Key Metrics for AI Systems

Essential metrics to track for AI/ML systems in production.

🎯 Model Metrics: Accuracy, Precision, Recall, F1-Score
⚡ Performance: Latency, Throughput, Response Time
💻 System: CPU, Memory, GPU Utilization
📈 Business: User Engagement, Conversion Rate
⚠️ Drift: Data Drift, Concept Drift, Model Decay

Observable Architecture

Design patterns for building observable AI systems from the ground up.

# Observable AI Service Architecture
from dataclasses import dataclass
from typing import Dict, Any
import time

@dataclass
class ObservableModel:
    model: Any
    metrics_collector: Any
    logger: Any
    tracer: Any
    
    def predict(self, input_data: Dict) -> Dict:
        # Start trace
        with self.tracer.start_span("prediction") as span:
            start_time = time.time()
            
            # Log request
            self.logger.info("Prediction request", 
                           extra={"input_size": len(input_data)})
            
            # Make prediction
            prediction = self.model.predict(input_data)
            
            # Record metrics
            latency = time.time() - start_time
            self.metrics_collector.record(
                "prediction_latency", latency
            )
            
            # Add trace metadata
            span.set_attribute("latency_ms", latency * 1000)
            span.set_attribute("model_version", self.model.version)
            
            return prediction

Essential Tools

Core tools and platforms for observability in AI systems.

📊 Metrics: Prometheus, Grafana, DataDog
📝 Logging: ELK Stack, Splunk, CloudWatch
🔍 Tracing: Jaeger, Zipkin, X-Ray
🤖 ML-Specific: MLflow, Weights & Biases, Neptune
🚨 Alerting: PagerDuty, Opsgenie, VictorOps

Best Practices

Guidelines for implementing effective observability in production AI systems.

# Observability Best Practices

# 1. Use structured logging
logger.info("model_inference", {
    "model_id": model_id,
    "input_type": input_type,
    "latency_ms": latency,
    "status": "success"
})

# 2. Implement correlation IDs
trace_id = generate_trace_id()
logger = logger.bind(trace_id=trace_id)

# 3. Use semantic conventions
metrics.counter("http.server.requests",
    tags={
        "http.method": "POST",
        "http.status_code": 200,
        "http.route": "/predict"
    })

# 4. Set up SLIs and SLOs
sli = {
    "availability": "successful_requests / total_requests",
    "latency": "p95_latency < 100ms",
    "accuracy": "correct_predictions / total_predictions"
}

Getting Started

Step-by-step guide to implement observability in your AI system.

Define key metrics and SLIs for your system
Implement structured logging with correlation IDs
Set up distributed tracing for request flows
Create dashboards for real-time monitoring
Configure alerts for critical metrics
Establish runbooks for incident response

System Health Dashboard

99.95%

Uptime

42ms

P95 Latency

94.2%

Model Accuracy

1,234

Requests/Sec

Monitoring & Metrics

Metrics Collection

Implement comprehensive metrics collection for AI systems.

# Prometheus metrics for AI system
from prometheus_client import (
    Counter, Histogram, Gauge, Summary
)

# Define metrics
prediction_counter = Counter(
    'ml_predictions_total',
    'Total number of predictions',
    ['model', 'version', 'status']
)

prediction_latency = Histogram(
    'ml_prediction_duration_seconds',
    'Prediction latency in seconds',
    ['model', 'version'],
    buckets=(0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0)
)

model_accuracy = Gauge(
    'ml_model_accuracy',
    'Current model accuracy',
    ['model', 'version']
)

data_drift_score = Gauge(
    'ml_data_drift_score',
    'Data drift score (0-1)',
    ['feature', 'model']
)

# Use metrics in prediction function
def predict_with_metrics(model, input_data):
    with prediction_latency.labels(
        model=model.name,
        version=model.version
    ).time():
        try:
            prediction = model.predict(input_data)
            prediction_counter.labels(
                model=model.name,
                version=model.version,
                status='success'
            ).inc()
            return prediction
        except Exception as e:
            prediction_counter.labels(
                model=model.name,
                version=model.version,
                status='error'
            ).inc()
            raise e

Custom Metrics

Define and track custom metrics specific to your AI use case.

# Custom metrics for recommendation system
class RecommendationMetrics:
    def __init__(self):
        self.click_through_rate = Gauge(
            'rec_click_through_rate',
            'Click-through rate for recommendations'
        )
        
        self.conversion_rate = Gauge(
            'rec_conversion_rate',
            'Conversion rate for recommendations'
        )
        
        self.diversity_score = Histogram(
            'rec_diversity_score',
            'Diversity score of recommendations',
            buckets=(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)
        )
        
        self.personalization_score = Summary(
            'rec_personalization_score',
            'Personalization score'
        )
    
    def record_interaction(self, user_id, item_id, action):
        # Update metrics based on user interaction
        if action == 'click':
            self.update_ctr()
        elif action == 'purchase':
            self.update_conversion()
        
        self.calculate_diversity()
        self.calculate_personalization(user_id)

Data Drift Detection

Monitor and detect data drift in production ML systems.

# Data drift monitoring
import numpy as np
from scipy import stats

class DriftMonitor:
    def __init__(self, baseline_data, threshold=0.05):
        self.baseline = baseline_data
        self.threshold = threshold
        self.drift_metrics = []
    
    def detect_drift(self, current_data):
        drift_detected = {}
        
        for feature in self.baseline.columns:
            # Kolmogorov-Smirnov test
            ks_statistic, p_value = stats.ks_2samp(
                self.baseline[feature],
                current_data[feature]
            )
            
            # Jensen-Shannon divergence
            js_divergence = self.calculate_js_divergence(
                self.baseline[feature],
                current_data[feature]
            )
            
            drift_detected[feature] = {
                'ks_statistic': ks_statistic,
                'p_value': p_value,
                'js_divergence': js_divergence,
                'drift': p_value < self.threshold
            }
            
            # Update metrics
            data_drift_score.labels(
                feature=feature,
                model='current'
            ).set(js_divergence)
        
        return drift_detected

Performance Monitoring

Track system performance and resource utilization.

# System performance monitoring
import psutil
import GPUtil

class SystemMonitor:
    def __init__(self):
        self.cpu_usage = Gauge('system_cpu_usage_percent', 'CPU usage')
        self.memory_usage = Gauge('system_memory_usage_percent', 'Memory usage')
        self.gpu_usage = Gauge('system_gpu_usage_percent', 'GPU usage', ['gpu_id'])
        self.gpu_memory = Gauge('system_gpu_memory_mb', 'GPU memory', ['gpu_id'])
    
    def collect_metrics(self):
        # CPU metrics
        self.cpu_usage.set(psutil.cpu_percent(interval=1))
        
        # Memory metrics
        memory = psutil.virtual_memory()
        self.memory_usage.set(memory.percent)
        
        # GPU metrics
        gpus = GPUtil.getGPUs()
        for gpu in gpus:
            self.gpu_usage.labels(gpu_id=gpu.id).set(gpu.load * 100)
            self.gpu_memory.labels(gpu_id=gpu.id).set(gpu.memoryUsed)

Dashboard Creation

Build effective dashboards for monitoring AI systems.

# Grafana dashboard configuration
dashboard_config = {
    "title": "AI System Monitoring",
    "panels": [
        {
            "title": "Prediction Latency",
            "type": "graph",
            "targets": [{
                "expr": "histogram_quantile(0.95, ml_prediction_duration_seconds)",
                "legendFormat": "P95 Latency"
            }]
        },
        {
            "title": "Model Accuracy Trend",
            "type": "graph",
            "targets": [{
                "expr": "ml_model_accuracy",
                "legendFormat": "{{model}} v{{version}}"
            }]
        },
        {
            "title": "Data Drift Score",
            "type": "heatmap",
            "targets": [{
                "expr": "ml_data_drift_score"
            }]
        },
        {
            "title": "Request Rate",
            "type": "stat",
            "targets": [{
                "expr": "rate(ml_predictions_total[5m])"
            }]
        }
    ]
}

Alerting Rules

Configure intelligent alerts for AI system issues.

# Prometheus alerting rules
alerting_rules = """
groups:
  - name: ml_alerts
    rules:
    - alert: HighPredictionLatency
      expr: histogram_quantile(0.95, ml_prediction_duration_seconds) > 0.5
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: High prediction latency detected
        description: P95 latency is {{ $value }}s
    
    - alert: ModelAccuracyDrop
      expr: ml_model_accuracy < 0.85
      for: 10m
      labels:
        severity: critical
      annotations:
        summary: Model accuracy below threshold
        description: Accuracy dropped to {{ $value }}
    
    - alert: DataDriftDetected
      expr: ml_data_drift_score > 0.3
      for: 15m
      labels:
        severity: warning
      annotations:
        summary: Significant data drift detected
        description: Drift score is {{ $value }}
"""

Model Performance Healthy

Accuracy 94.2%

Precision 92.8%

System Resources Warning

CPU Usage 72%

Memory 85%

Data Quality Healthy

Completeness 98.5%

Drift Score 0.12

Logging & Events

Structured Logging

Implement structured logging for better searchability and analysis.

# Structured logging with Python
import structlog
import json
from datetime import datetime

# Configure structured logging
structlog.configure(
    processors=[
        structlog.stdlib.filter_by_level,
        structlog.stdlib.add_logger_name,
        structlog.stdlib.add_log_level,
        structlog.stdlib.PositionalArgumentsFormatter(),
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.StackInfoRenderer(),
        structlog.processors.format_exc_info,
        structlog.processors.UnicodeDecoder(),
        structlog.processors.JSONRenderer()
    ],
    context_class=dict,
    logger_factory=structlog.stdlib.LoggerFactory(),
    cache_logger_on_first_use=True,
)

# Create logger with context
logger = structlog.get_logger()
logger = logger.bind(
    service="ml-api",
    environment="production",
    version="1.2.3"
)

# Log with structured data
logger.info(
    "prediction_request",
    user_id="user123",
    model="recommendation_v2",
    input_size=150,
    features={"category": "electronics", "price_range": "high"},
    latency_ms=42.5
)

Log Aggregation

Centralize logs from distributed AI systems for analysis.

# ELK Stack configuration for log aggregation
# Logstash pipeline configuration
logstash_config = """
input {
  beats {
    port => 5044
  }
  
  kafka {
    bootstrap_servers => "kafka:9092"
    topics => ["ml-logs"]
    codec => json
  }
}

filter {
  # Parse JSON logs
  json {
    source => "message"
  }
  
  # Extract ML-specific fields
  if [model] {
    mutate {
      add_field => { "ml_system" => true }
    }
  }
  
  # Calculate additional metrics
  ruby {
    code => "
      if event.get('latency_ms')
        event.set('latency_category', 
          event.get('latency_ms') < 100 ? 'fast' : 'slow')
      end
    "
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "ml-logs-%{+YYYY.MM.dd}"
  }
}
"""

Event Streaming

Stream and process events in real-time for immediate insights.

# Event streaming with Kafka
from kafka import KafkaProducer, KafkaConsumer
import json

class EventStreamer:
    def __init__(self, bootstrap_servers):
        self.producer = KafkaProducer(
            bootstrap_servers=bootstrap_servers,
            value_serializer=lambda v: json.dumps(v).encode()
        )
    
    def emit_prediction_event(self, event_data):
        event = {
            'timestamp': datetime.utcnow().isoformat(),
            'event_type': 'prediction',
            'model': event_data['model'],
            'user_id': event_data['user_id'],
            'prediction': event_data['prediction'],
            'confidence': event_data['confidence'],
            'latency_ms': event_data['latency']
        }
        
        self.producer.send('ml-events', value=event)
    
    def emit_drift_event(self, drift_data):
        event = {
            'timestamp': datetime.utcnow().isoformat(),
            'event_type': 'drift_detected',
            'feature': drift_data['feature'],
            'drift_score': drift_data['score'],
            'threshold': drift_data['threshold'],
            'action': 'alert'
        }
        
        self.producer.send('ml-alerts', value=event)

Audit Logging

Maintain comprehensive audit trails for compliance and debugging.

# Audit logging for ML systems
class AuditLogger:
    def __init__(self, logger):
        self.logger = logger
    
    def log_model_deployment(self, model_info):
        self.logger.info(
            "model_deployed",
            event_type="audit",
            model_name=model_info['name'],
            model_version=model_info['version'],
            deployed_by=model_info['user'],
            deployment_time=datetime.utcnow().isoformat(),
            environment=model_info['environment'],
            config_hash=model_info['config_hash']
        )
    
    def log_data_access(self, access_info):
        self.logger.info(
            "data_accessed",
            event_type="audit",
            user_id=access_info['user_id'],
            dataset=access_info['dataset'],
            access_type=access_info['type'],
            records_accessed=access_info['count'],
            purpose=access_info['purpose']
        )
    
    def log_prediction(self, prediction_info):
        self.logger.info(
            "prediction_made",
            event_type="audit",
            request_id=prediction_info['request_id'],
            model_version=prediction_info['model_version'],
            input_hash=hash(str(prediction_info['input'])),
            prediction=prediction_info['output'],
            confidence=prediction_info['confidence']
        )

Log Analysis

Analyze logs to extract insights and detect anomalies.

# Log analysis queries
# Elasticsearch queries for ML logs

# Find slow predictions
slow_predictions_query = {
    "query": {
        "bool": {
            "must": [
                {"match": {"event_type": "prediction"}},
                {"range": {"latency_ms": {"gte": 100}}}
            ]
        }
    },
    "aggs": {
        "avg_latency": {"avg": {"field": "latency_ms"}},
        "models": {
            "terms": {"field": "model.keyword"}
        }
    }
}

# Detect error patterns
error_pattern_query = {
    "query": {
        "bool": {
            "must": [
                {"match": {"level": "ERROR"}},
                {"range": {
                    "@timestamp": {
                        "gte": "now-1h"
                    }
                }}
            ]
        }
    },
    "aggs": {
        "error_types": {
            "terms": {"field": "error_type.keyword"}
        },
        "error_timeline": {
            "date_histogram": {
                "field": "@timestamp",
                "interval": "5m"
            }
        }
    }
}

2024-01-15 10:23:45 INFO Model inference started for user_id: user123

2024-01-15 10:23:46 INFO Prediction completed: latency=42ms, confidence=0.92

2024-01-15 10:23:47 WARN High memory usage detected: 85% utilized

Distributed Tracing

Trace Implementation

Implement distributed tracing for request flow visibility.

# OpenTelemetry tracing setup
from opentelemetry import trace
from opentelemetry.exporter.jaeger import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.instrumentation.flask import FlaskInstrumentor

# Configure tracing
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)

# Configure Jaeger exporter
jaeger_exporter = JaegerExporter(
    agent_host_name="localhost",
    agent_port=6831,
)

span_processor = BatchSpanProcessor(jaeger_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)

# Instrument Flask app
FlaskInstrumentor().instrument_app(app)

# Custom span for ML operations
def predict_with_tracing(model, input_data):
    with tracer.start_as_current_span("ml_prediction") as span:
        # Add span attributes
        span.set_attribute("model.name", model.name)
        span.set_attribute("model.version", model.version)
        span.set_attribute("input.size", len(input_data))
        
        # Preprocessing span
        with tracer.start_as_current_span("preprocessing"):
            processed_data = preprocess(input_data)
        
        # Inference span
        with tracer.start_as_current_span("inference") as inference_span:
            prediction = model.predict(processed_data)
            inference_span.set_attribute("prediction.confidence", 
                                       float(prediction.confidence))
        
        # Postprocessing span
        with tracer.start_as_current_span("postprocessing"):
            result = postprocess(prediction)
        
        span.set_attribute("prediction.result", str(result))
        return result

Context Propagation

Propagate trace context across service boundaries.

# Context propagation across services
from opentelemetry.propagate import inject, extract
import requests

class TracedHTTPClient:
    def __init__(self):
        self.session = requests.Session()
    
    def call_service(self, url, data):
        with tracer.start_as_current_span("http_request") as span:
            # Add span attributes
            span.set_attribute("http.method", "POST")
            span.set_attribute("http.url", url)
            
            # Inject trace context into headers
            headers = {}
            inject(headers)
            
            # Make request with trace context
            response = self.session.post(
                url, 
                json=data, 
                headers=headers
            )
            
            # Record response
            span.set_attribute("http.status_code", response.status_code)
            
            return response

# Extract context in receiving service
@app.route('/predict', methods=['POST'])
def predict_endpoint():
    # Extract trace context from headers
    context = extract(request.headers)
    
    with tracer.start_as_current_span(
        "handle_prediction", 
        context=context
    ) as span:
        # Process request
        data = request.json
        result = predict_with_tracing(model, data)
        return jsonify(result)

Trace Analysis

Analyze traces to identify bottlenecks and optimize performance.

# Trace analysis utilities
class TraceAnalyzer:
    def __init__(self, jaeger_client):
        self.client = jaeger_client
    
    def analyze_latency(self, service_name, operation):
        # Query traces from Jaeger
        traces = self.client.get_traces(
            service=service_name,
            operation=operation,
            limit=1000
        )
        
        latencies = []
        for trace in traces:
            for span in trace.spans:
                if span.operation_name == operation:
                    latencies.append(span.duration)
        
        return {
            'p50': np.percentile(latencies, 50),
            'p95': np.percentile(latencies, 95),
            'p99': np.percentile(latencies, 99),
            'mean': np.mean(latencies),
            'std': np.std(latencies)
        }
    
    def find_bottlenecks(self, trace_id):
        trace = self.client.get_trace(trace_id)
        spans_by_duration = sorted(
            trace.spans, 
            key=lambda s: s.duration, 
            reverse=True
        )
        
        bottlenecks = []
        total_duration = trace.duration
        
        for span in spans_by_duration[:5]:
            bottlenecks.append({
                'operation': span.operation_name,
                'duration_ms': span.duration / 1000,
                'percentage': (span.duration / total_duration) * 100
            })
        
        return bottlenecks

10:23:45.123

Request Received

HTTP POST /api/predict

10:23:45.135

Data Preprocessing

Validation and transformation: 12ms

10:23:45.178

Model Inference

Prediction generated: 43ms

10:23:45.195

Response Sent

Total latency: 72ms

Tools & Platforms

Prometheus + Grafana

Open-source monitoring and visualization stack.

# docker-compose.yml for monitoring stack
version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    volumes:
      - grafana_data:/var/lib/grafana
      - ./dashboards:/etc/grafana/provisioning/dashboards
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_INSTALL_PLUGINS=grafana-piechart-panel

  alertmanager:
    image: prom/alertmanager:latest
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml

volumes:
  prometheus_data:
  grafana_data:

ELK Stack

Elasticsearch, Logstash, and Kibana for log management.

# ELK Stack setup
# Filebeat configuration for ML logs
filebeat.inputs:
  - type: log
    enabled: true
    paths:
      - /var/log/ml-api/*.log
    json.keys_under_root: true
    json.add_error_key: true
    multiline.pattern: '^{'
    multiline.negate: true
    multiline.match: after

processors:
  - add_host_metadata:
      when.not.contains:
        tags: forwarded
  - add_docker_metadata: ~
  - add_kubernetes_metadata: ~

output.elasticsearch:
  hosts: ["elasticsearch:9200"]
  index: "ml-logs-%{+yyyy.MM.dd}"
  template.name: "ml-logs"
  template.pattern: "ml-logs-*"

Jaeger

Distributed tracing platform for microservices.

# Jaeger deployment
docker run -d --name jaeger \
  -e COLLECTOR_ZIPKIN_HOST_PORT=:9411 \
  -p 5775:5775/udp \
  -p 6831:6831/udp \
  -p 6832:6832/udp \
  -p 5778:5778 \
  -p 16686:16686 \
  -p 14268:14268 \
  -p 14250:14250 \
  -p 9411:9411 \
  jaegertracing/all-in-one:1.32

ML-Specific Tools

Specialized tools for ML observability.

MLflow: Experiment tracking and model registry
Weights & Biases: ML experiment tracking
Neptune.ai: Metadata store for ML
Evidently: ML monitoring and testing
WhyLabs: ML observability platform

Cloud Solutions

Managed observability services from cloud providers.

AWS: CloudWatch, X-Ray, OpenSearch
GCP: Cloud Monitoring, Cloud Logging, Cloud Trace
Azure: Monitor, Application Insights, Log Analytics
DataDog: Full-stack observability platform
New Relic: Application performance monitoring

Setup Guide

Quick setup for a complete observability stack.

#!/bin/bash
# Setup observability stack

# 1. Create monitoring namespace
kubectl create namespace monitoring

# 2. Install Prometheus Operator
helm repo add prometheus-community \
  https://prometheus-community.github.io/helm-charts
helm install prometheus \
  prometheus-community/kube-prometheus-stack \
  --namespace monitoring

# 3. Install Elasticsearch
helm repo add elastic https://helm.elastic.co
helm install elasticsearch elastic/elasticsearch \
  --namespace monitoring

# 4. Install Jaeger
helm repo add jaegertracing \
  https://jaegertracing.github.io/helm-charts
helm install jaeger jaegertracing/jaeger \
  --namespace monitoring

# 5. Configure ingress
kubectl apply -f monitoring-ingress.yaml

Practice & Exercises

Exercise 1: Implement Metrics

Add comprehensive metrics to an ML service.

# Complete this metrics implementation
from prometheus_client import start_http_server

class MLService:
    def __init__(self):
        # TODO: Define metrics
        self.prediction_counter = ???
        self.latency_histogram = ???
        self.accuracy_gauge = ???
    
    def predict(self, input_data):
        # TODO: Add metrics collection
        pass
    
    def start_metrics_server(self, port=8000):
        # TODO: Start Prometheus metrics endpoint
        pass

Exercise 2: Add Tracing

Implement distributed tracing for an ML pipeline.

# Add tracing to ML pipeline
def ml_pipeline(data):
    # TODO: Add root span
    
    # TODO: Add span for data validation
    validated_data = validate(data)
    
    # TODO: Add span for preprocessing
    processed_data = preprocess(validated_data)
    
    # TODO: Add span for prediction
    prediction = model.predict(processed_data)
    
    # TODO: Add span for postprocessing
    result = postprocess(prediction)
    
    return result

Exercise 3: Create Dashboard

Build a monitoring dashboard for your AI system.

# Grafana dashboard JSON
{
  "dashboard": {
    "title": "ML System Dashboard",
    "panels": [
      {
        // TODO: Add panel for request rate
      },
      {
        // TODO: Add panel for latency percentiles
      },
      {
        // TODO: Add panel for model accuracy
      },
      {
        // TODO: Add panel for error rate
      }
    ]
  }
}

📚

Learning Resources:

Key Takeaways

1. Observability = Metrics + Logs + Traces 2. Instrument everything from day one 3. Use structured logging for better analysis 4. Monitor both system and ML-specific metrics 5. Set up alerts for critical issues 6. Create actionable dashboards 7. Implement distributed tracing for complex systems