ML Lifecycle - AI Training Hub

Lifecycle Phases

1. Problem Definition

Identify business objectives, define success metrics, and determine if ML is the right solution.

# Define clear success metrics
business_metrics = {
    'accuracy_threshold': 0.95,
    'latency_requirement': '< 100ms',
    'cost_per_prediction': '< $0.001',
    'roi_target': '3x within 6 months'
}

2. Data Collection & Preparation

Gather, clean, and organize data. Implement feature engineering and data validation pipelines.

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Data preparation pipeline
df = pd.read_csv('raw_data.csv')
df = df.dropna()
df = df[df['value'] > 0]

# Feature engineering
df['new_feature'] = df['feature1'] / df['feature2']
df['category_encoded'] = pd.get_dummies(df['category'])

# Split and scale
X_train, X_test, y_train, y_test = train_test_split(
    df.drop('target', axis=1), 
    df['target'], 
    test_size=0.2,
    random_state=42
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

3. Model Development

Select algorithms, train models, and optimize hyperparameters using cross-validation.

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

# Model development workflow
rf = RandomForestClassifier(random_state=42)

param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5, 10]
}

# Grid search with cross-validation
grid_search = GridSearchCV(
    rf, param_grid, 
    cv=5, 
    scoring='f1_macro',
    n_jobs=-1
)

grid_search.fit(X_train_scaled, y_train)
best_model = grid_search.best_estimator_

4. Model Evaluation

Assess model performance using multiple metrics and validate against business requirements.

from sklearn.metrics import (
    accuracy_score, precision_recall_curve,
    roc_auc_score, confusion_matrix,
    classification_report
)

# Evaluate model
y_pred = best_model.predict(X_test_scaled)
y_pred_proba = best_model.predict_proba(X_test_scaled)

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
auc_score = roc_auc_score(y_test, y_pred_proba[:, 1])

print(f"Accuracy: {accuracy:.3f}")
print(f"AUC Score: {auc_score:.3f}")
print(classification_report(y_test, y_pred))

5. Model Deployment

Deploy model to production environment with proper versioning and rollback capabilities.

# Model deployment with Flask
from flask import Flask, request, jsonify
import joblib

app = Flask(__name__)
model = joblib.load('model_v1.pkl')
scaler = joblib.load('scaler.pkl')

@app.route('/predict', methods=['POST'])
def predict():
    data = request.json
    features = scaler.transform([data['features']])
    prediction = model.predict(features)[0]
    confidence = model.predict_proba(features)[0].max()
    
    return jsonify({
        'prediction': int(prediction),
        'confidence': float(confidence)
    })

6. Monitoring & Maintenance

Track model performance, detect drift, and implement retraining pipelines.

# Model monitoring system
class ModelMonitor:
    def __init__(self, baseline_metrics):
        self.baseline = baseline_metrics
        self.alerts = []
    
    def check_drift(self, current_metrics):
        drift_detected = False
        
        for metric, value in current_metrics.items():
            baseline_value = self.baseline.get(metric)
            if baseline_value:
                drift = abs(value - baseline_value) / baseline_value
                if drift > 0.1:  # 10% threshold
                    self.alerts.append({
                        'metric': metric,
                        'drift': drift,
                        'action': 'retrain_required'
                    })
                    drift_detected = True
        
        return drift_detected

ML Pipeline Simulator

📊

Data

🧪

Train

✅

Evaluate

🚀

Deploy

📈

Monitor

Pipeline Status

Ready to start...

MLOps & DevOps

Version Control

Track code, data, and model versions for reproducibility and collaboration.

# DVC for data version control
$ dvc init
$ dvc add data/training_data.csv
$ git add data/training_data.csv.dvc
$ git commit -m "Add training data v1.0"

# MLflow for model versioning
import mlflow
import mlflow.sklearn

with mlflow.start_run():
    mlflow.log_param("n_estimators", 100)
    mlflow.log_metric("accuracy", accuracy)
    mlflow.sklearn.log_model(model, "model")

CI/CD Pipelines

Automate testing, validation, and deployment of ML models.

# GitHub Actions workflow
name: ML Pipeline

on:
  push:
    branches: [main]

jobs:
  test-and-deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      
      - name: Run tests
        run: |
          pytest tests/
          
      - name: Train model
        run: |
          python train.py
          
      - name: Evaluate model
        run: |
          python evaluate.py
          
      - name: Deploy if passing
        if: success()
        run: |
          python deploy.py

Containerization

Package models with dependencies for consistent deployment across environments.

# Dockerfile for ML model
FROM python:3.9-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY model/ ./model/
COPY app.py .

EXPOSE 5000

CMD ["gunicorn", "--bind", "0.0.0.0:5000", "app:app"]

Experiment Tracking

Log experiments, compare results, and manage model registry.

# Weights & Biases tracking
import wandb

wandb.init(project="ml-lifecycle")

config = wandb.config
config.learning_rate = 0.001
config.batch_size = 32
config.epochs = 100

for epoch in range(config.epochs):
    train_loss = train_epoch()
    val_loss = validate()
    
    wandb.log({
        "train_loss": train_loss,
        "val_loss": val_loss,
        "epoch": epoch
    })

Security & Compliance

Implement security best practices and ensure regulatory compliance.

# Model security checks
class SecurityValidator:
    def validate_input(self, data):
        # Check for SQL injection patterns
        if re.search(r'(DROP|DELETE|INSERT|UPDATE)', str(data)):
            raise ValueError("Suspicious input detected")
        
        # Validate data types
        if not isinstance(data, dict):
            raise TypeError("Invalid input format")
        
        # Check for PII
        if self.contains_pii(data):
            data = self.anonymize_pii(data)
        
        return data

Infrastructure as Code

Define and manage ML infrastructure using code for scalability.

# Terraform configuration
resource "aws_sagemaker_model" "ml_model" {
  name               = "ml-lifecycle-model"
  execution_role_arn = aws_iam_role.sagemaker.arn

  primary_container {
    image          = "${var.ecr_uri}:latest"
    model_data_url = "s3://${var.model_bucket}/model.tar.gz"
  }
}

resource "aws_sagemaker_endpoint" "ml_endpoint" {
  name                 = "ml-lifecycle-endpoint"
  endpoint_config_name = aws_sagemaker_endpoint_configuration.ml_config.name
}

💡

MLOps Best Practices:

Automate everything: training, testing, deployment
Version control code, data, and models
Monitor model performance continuously
Implement gradual rollout strategies
Maintain reproducibility across environments

Tools & Platforms

Cloud Platforms

Comprehensive ML services from major cloud providers.

AWS SageMaker: End-to-end ML platform
Google Vertex AI: Unified ML platform
Azure ML: Enterprise ML service
IBM Watson: AI and ML tools

Experiment Tracking

Tools for tracking experiments and managing models.

MLflow: Open-source platform
Weights & Biases: Experiment tracking
Neptune.ai: Metadata store
Comet ML: Model management

Orchestration

Workflow orchestration and pipeline management tools.

Apache Airflow: Workflow management
Kubeflow: K8s ML workflows
Prefect: Modern dataflow automation
Dagster: Data orchestrator

Data Management

Tools for data versioning and feature management.

DVC: Data version control
Feast: Feature store
Tecton: Feature platform
Great Expectations: Data validation

Model Serving

Platforms for deploying and serving ML models.

TensorFlow Serving: TF model serving
TorchServe: PyTorch model serving
Seldon Core: K8s ML deployment
BentoML: Model packaging

Monitoring

Tools for monitoring model performance and drift.

Evidently AI: ML monitoring
WhyLabs: Model observability
Arize: ML observability platform
Prometheus: Metrics monitoring

Platform Comparison Tool

Select Use Case:

Deployment Strategies

Blue-Green Deployment

Deploy new version alongside the old, then switch traffic instantly.

# Blue-Green deployment with Kubernetes
apiVersion: v1
kind: Service
metadata:
  name: ml-model-service
spec:
  selector:
    app: ml-model
    version: green  # Switch between blue/green
  ports:
    - port: 80
      targetPort: 5000

Canary Deployment

Gradually roll out new model to a small percentage of users.

# Canary deployment configuration
class CanaryDeployment:
    def __init__(self, canary_percentage=5):
        self.canary_percentage = canary_percentage
    
    def route_request(self, request_id):
        if hash(request_id) % 100 < self.canary_percentage:
            return "new_model"
        return "stable_model"

Rolling Deployment

Update instances one at a time with zero downtime.

# Rolling update configuration
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-model
spec:
  replicas: 10
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 2
      maxUnavailable: 1

Serverless Deployment

Deploy models as serverless functions for automatic scaling.

# AWS Lambda deployment
import json
import boto3
import joblib

model = joblib.load('/tmp/model.pkl')

def lambda_handler(event, context):
    features = json.loads(event['body'])['features']
    prediction = model.predict([features])[0]
    
    return {
        'statusCode': 200,
        'body': json.dumps({
            'prediction': int(prediction)
        })
    }

Edge Deployment

Deploy models to edge devices for low-latency inference.

# TensorFlow Lite edge deployment
import tensorflow as tf

# Convert model to TFLite
converter = tf.lite.TFLiteConverter.from_saved_model('model/')
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model = converter.convert()

# Save for edge device
with open('model.tflite', 'wb') as f:
    f.write(tflite_model)

A/B Testing

Compare model versions to determine the best performer.

# A/B testing framework
class ABTest:
    def __init__(self, models, split_ratio):
        self.models = models
        self.split_ratio = split_ratio
        self.metrics = {name: [] for name in models}
    
    def assign_model(self, user_id):
        if hash(user_id) % 100 < self.split_ratio * 100:
            return 'model_a'
        return 'model_b'
    
    def record_outcome(self, model_name, metric_value):
        self.metrics[model_name].append(metric_value)

✅

Deployment Checklist:

Model versioning and rollback plan ready
API documentation and client libraries
Load testing and performance benchmarks
Monitoring and alerting configured
Security audit and compliance check
Disaster recovery plan in place

Monitoring & Maintenance

99.9%

Uptime

45ms

Avg Latency

94.2%

Accuracy

1.2M

Daily Predictions

Performance Monitoring

Track model accuracy, latency, and resource utilization.

# Performance monitoring
import time
from prometheus_client import Counter, Histogram, Gauge

# Define metrics
prediction_counter = Counter('ml_predictions_total', 
                            'Total predictions')
latency_histogram = Histogram('ml_prediction_latency_seconds',
                             'Prediction latency')
accuracy_gauge = Gauge('ml_model_accuracy', 
                      'Current model accuracy')

@latency_histogram.time()
def predict(features):
    prediction_counter.inc()
    result = model.predict(features)
    return result

Data Drift Detection

Monitor input data distribution changes over time.

# Data drift detection
from scipy import stats
import numpy as np

class DriftDetector:
    def __init__(self, reference_data, threshold=0.05):
        self.reference = reference_data
        self.threshold = threshold
    
    def detect_drift(self, current_data):
        drift_detected = []
        
        for column in self.reference.columns:
            # Kolmogorov-Smirnov test
            statistic, p_value = stats.ks_2samp(
                self.reference[column], 
                current_data[column]
            )
            
            if p_value < self.threshold:
                drift_detected.append({
                    'feature': column,
                    'p_value': p_value,
                    'drift': True
                })
        
        return drift_detected

Model Explainability

Understand and explain model predictions for transparency.

# SHAP for model explainability
import shap

# Create explainer
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

# Get feature importance
feature_importance = pd.DataFrame({
    'feature': X_test.columns,
    'importance': np.abs(shap_values).mean(0)
}).sort_values('importance', ascending=False)

# Explain single prediction
shap.force_plot(explainer.expected_value, 
                shap_values[0], 
                X_test.iloc[0])

Alerting System

Set up alerts for model degradation and system issues.

# Alert configuration
alerts = {
    'accuracy_drop': {
        'threshold': 0.85,
        'severity': 'critical',
        'action': 'page_oncall'
    },
    'latency_spike': {
        'threshold': 200,  # ms
        'severity': 'warning',
        'action': 'send_email'
    },
    'error_rate': {
        'threshold': 0.01,  # 1%
        'severity': 'critical',
        'action': 'auto_rollback'
    }
}

Automated Retraining

Implement pipelines for automatic model retraining.

# Automated retraining pipeline
class RetrainingPipeline:
    def __init__(self, schedule='weekly'):
        self.schedule = schedule
        self.last_training = datetime.now()
    
    def should_retrain(self):
        conditions = [
            self.check_schedule(),
            self.check_performance_drop(),
            self.check_data_drift(),
            self.check_new_data_volume()
        ]
        return any(conditions)
    
    def retrain(self):
        # Pull latest data
        new_data = self.fetch_latest_data()
        
        # Train new model
        new_model = self.train_model(new_data)
        
        # Validate performance
        if self.validate_model(new_model):
            self.deploy_model(new_model)
            self.last_training = datetime.now()

Audit Logging

Maintain comprehensive logs for debugging and compliance.

# Structured logging
import logging
import json

class MLAuditLogger:
    def __init__(self):
        self.logger = logging.getLogger('ml_audit')
    
    def log_prediction(self, request_id, features, 
                         prediction, confidence, latency):
        log_entry = {
            'timestamp': datetime.now().isoformat(),
            'request_id': request_id,
            'model_version': self.model_version,
            'features_hash': hash(str(features)),
            'prediction': prediction,
            'confidence': confidence,
            'latency_ms': latency
        }
        self.logger.info(json.dumps(log_entry))

Live Monitoring Dashboard

95.2%

Live Accuracy

42ms

Live Latency

1,234

Requests/min

✅ Normal

Drift Status

Practice Exercises

Exercise 1: Build a Pipeline

Create an end-to-end ML pipeline for a classification problem.

# Your task: Complete the pipeline
class MLPipeline:
    def __init__(self):
        self.model = None
        self.scaler = None
    
    def load_data(self, path):
        # TODO: Load and validate data
        pass
    
    def preprocess(self, data):
        # TODO: Clean and transform data
        pass
    
    def train(self, X_train, y_train):
        # TODO: Train and optimize model
        pass
    
    def evaluate(self, X_test, y_test):
        # TODO: Evaluate model performance
        pass
    
    def deploy(self):
        # TODO: Deploy model to production
        pass

Exercise 2: Implement Monitoring

Add monitoring capabilities to track model performance.

# Your task: Add monitoring
class ModelMonitor:
    def __init__(self, model):
        self.model = model
        self.metrics = []
    
    def log_prediction(self, features, prediction):
        # TODO: Log prediction details
        pass
    
    def calculate_metrics(self):
        # TODO: Calculate performance metrics
        pass
    
    def detect_drift(self, new_data):
        # TODO: Check for data drift
        pass
    
    def send_alert(self, message):
        # TODO: Send alert notification
        pass

Exercise 3: Deploy with Docker

Containerize and deploy your ML model using Docker.

# Your task: Complete the Dockerfile
FROM python:3.9-slim

WORKDIR /app

# TODO: Copy requirements
COPY ? .

# TODO: Install dependencies
RUN ?

# TODO: Copy application code
COPY ? .

# TODO: Expose port
EXPOSE ?

# TODO: Define entry point
CMD [?]

📚

Learning Resources:

Key Takeaways

1. Automate Everything: From data validation to model deployment 2. Version Control: Track code, data, and models 3. Monitor Continuously: Watch for drift and degradation 4. Test Rigorously: Unit tests, integration tests, A/B tests 5. Document Thoroughly: APIs, models, and processes 6. Plan for Failure: Have rollback and recovery strategies