Best Practices | LIZIU Kubernetes

Resource Management

Resources are the foundation of stable Kubernetes clusters

Every container should declare what it needs (requests) and what it can use at most (limits). This lets the scheduler make smart placement decisions and prevents noisy neighbors.

Resource Requests and Limits

CPU and Memory Configuration

deployment-resources.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-deployment
spec:
  template:
    spec:
      containers:
      - name: app
        image: myapp:1.0
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"

Quality of Service Classes

qos-classes.yaml

# Guaranteed QoS (highest priority)
resources:
  requests:
    memory: "1Gi"
    cpu: "1"
  limits:
    memory: "1Gi"
    cpu: "1"

# Burstable QoS (medium priority)
resources:
  requests:
    memory: "512Mi"
    cpu: "500m"
  limits:
    memory: "1Gi"
    cpu: "1"

# BestEffort QoS (lowest priority - avoid in production)
# No resources specified

Resource Quotas

resource-quota.yaml

apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-quota
  namespace: production
spec:
  hard:
    requests.cpu: "100"
    requests.memory: 200Gi
    limits.cpu: "200"
    limits.memory: 400Gi
    persistentvolumeclaims: "10"
    services.loadbalancers: "2"
    services.nodeports: "0"  # Disable NodePort in production

---
apiVersion: v1
kind: LimitRange
metadata:
  name: mem-limit-range
  namespace: production
spec:
  limits:
  - default:
      memory: 512Mi
      cpu: 500m
    defaultRequest:
      memory: 256Mi
      cpu: 250m
    max:
      memory: 2Gi
      cpu: 2
    min:
      memory: 128Mi
      cpu: 100m
    type: Container

Resource Management Pitfalls

Never run containers without resource limits in production
Avoid BestEffort QoS for critical workloads
Don't set CPU limits too low - causes throttling
Memory limits too low cause OOMKilled pods
Always test resource settings under load

Health Checks

Three probes, three purposes

Kubernetes uses probes to know if your container is alive (liveness), ready for traffic (readiness), or still starting up (startup). Getting these right is critical for zero-downtime deployments.

Probe Configuration

Liveness Probe

liveness.yaml

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
    httpHeaders:
    - name: Custom-Header
      value: Awesome
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5
  successThreshold: 1
  failureThreshold: 3

Readiness Probe

readiness.yaml

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5
  timeoutSeconds: 3
  successThreshold: 1
  failureThreshold: 3

Startup Probe

startup.yaml

# For slow-starting containers
startupProbe:
  httpGet:
    path: /startup
    port: 8080
  initialDelaySeconds: 0
  periodSeconds: 10
  timeoutSeconds: 3
  successThreshold: 1
  failureThreshold: 30  # 30 * 10 = 300s max startup

Complete Health Check Example

health-checks-full.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: web
        image: webapp:2.0
        ports:
        - containerPort: 8080
          name: http

        # Startup probe for slow initialization
        startupProbe:
          httpGet:
            path: /startup
            port: http
          periodSeconds: 10
          failureThreshold: 30

        # Liveness probe to restart unhealthy containers
        livenessProbe:
          httpGet:
            path: /healthz
            port: http
          initialDelaySeconds: 0
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3

        # Readiness probe to control traffic
        readinessProbe:
          httpGet:
            path: /ready
            port: http
          initialDelaySeconds: 0
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 3

        # Graceful shutdown
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 15"]

Auto-Scaling

Horizontal Pod Autoscaler (HPA)

hpa.yaml

# HPA based on CPU
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-deployment
  minReplicas: 3
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
      - type: Pods
        value: 2
        periodSeconds: 60
      selectPolicy: Min
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
      - type: Percent
        value: 100
        periodSeconds: 60
      - type: Pods
        value: 4
        periodSeconds: 60
      selectPolicy: Max

Vertical Pod Autoscaler (VPA)

vpa.yaml

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: web-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-deployment
  updatePolicy:
    updateMode: "Auto"  # or "Off" for recommendation only
  resourcePolicy:
    containerPolicies:
    - containerName: web
      minAllowed:
        cpu: 100m
        memory: 128Mi
      maxAllowed:
        cpu: 2
        memory: 2Gi
      controlledResources: ["cpu", "memory"]

Cluster Autoscaler

cluster-autoscaler.yaml

# Cluster autoscaler deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
spec:
  template:
    spec:
      containers:
      - image: k8s.gcr.io/autoscaling/cluster-autoscaler:v1.21.0
        name: cluster-autoscaler
        command:
        - ./cluster-autoscaler
        - --v=4
        - --stderrthreshold=info
        - --cloud-provider=aws
        - --skip-nodes-with-local-storage=false
        - --expander=least-waste
        - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/my-cluster
        - --balance-similar-node-groups
        - --skip-nodes-with-system-pods=false

Reliability Patterns

Pod Disruption Budgets

pdb.yaml

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: web-pdb
spec:
  minAvailable: 2
  # OR use maxUnavailable
  # maxUnavailable: 1
  selector:
    matchLabels:
      app: web

  # For stateful services
  unhealthyPodEvictionPolicy: IfHealthyBudget

Anti-Affinity Rules

Pod Anti-Affinity

pod-anti-affinity.yaml

affinity:
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchExpressions:
        - key: app
          operator: In
          values:
          - web
      topologyKey: kubernetes.io/hostname
    preferredDuringSchedulingIgnoredDuringExecution:
    - weight: 100
      podAffinityTerm:
        labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - web
        topologyKey: failure-domain.beta.kubernetes.io/zone

Node Affinity

node-affinity.yaml

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: node-type
          operator: In
          values:
          - production
        - key: kubernetes.io/arch
          operator: In
          values:
          - amd64
    preferredDuringSchedulingIgnoredDuringExecution:
    - weight: 50
      preference:
        matchExpressions:
        - key: disk-type
          operator: In
          values:
          - ssd

Priority Classes

priority-class.yaml

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000
globalDefault: false
description: "Critical production workloads"

---
apiVersion: v1
kind: Pod
metadata:
  name: critical-app
spec:
  priorityClassName: high-priority
  containers:
  - name: app
    image: critical:1.0

Security Best Practices

Pod Security Standards

pod-security.yaml

# Pod Security Policy (deprecated, use Pod Security Standards)
apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted

---
# Security Context
apiVersion: v1
kind: Pod
metadata:
  name: secure-pod
spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    fsGroup: 2000
    seccompProfile:
      type: RuntimeDefault
  containers:
  - name: app
    image: myapp:1.0
    securityContext:
      allowPrivilegeEscalation: false
      readOnlyRootFilesystem: true
      capabilities:
        drop:
        - ALL
        add:
        - NET_BIND_SERVICE
    volumeMounts:
    - name: tmp
      mountPath: /tmp
    - name: cache
      mountPath: /app/cache
  volumes:
  - name: tmp
    emptyDir: {}
  - name: cache
    emptyDir: {}

Network Policies

network-policy.yaml

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: web-netpol
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: web
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: production
    - podSelector:
        matchLabels:
          app: frontend
    ports:
    - protocol: TCP
      port: 8080
  egress:
  - to:
    - podSelector:
        matchLabels:
          app: database
    ports:
    - protocol: TCP
      port: 5432
  - to:
    - namespaceSelector:
        matchLabels:
          name: kube-system
      podSelector:
        matchLabels:
          k8s-app: kube-dns
    ports:
    - protocol: UDP
      port: 53

Security Checklist

Enable RBAC and use least privilege
Use Pod Security Standards (restricted)
Implement Network Policies
Scan images for vulnerabilities
Use private image registries
Enable audit logging
Rotate secrets regularly
Use service mesh for mTLS
Implement admission controllers
Regular security updates

Observability

Logging Best Practices

fluent-bit-config.yaml

# Application logging configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: logging-config
data:
  fluent-bit.conf: |
    [SERVICE]
        Flush        5
        Daemon       Off
        Log_Level    info

    [INPUT]
        Name              tail
        Path              /var/log/containers/*.log
        Parser            docker
        Tag               kube.*
        Refresh_Interval  5
        Mem_Buf_Limit     50MB
        Skip_Long_Lines   On

    [FILTER]
        Name                kubernetes
        Match               kube.*
        Kube_URL            https://kubernetes.default.svc:443
        Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
        Merge_Log           On
        K8S-Logging.Parser  On
        K8S-Logging.Exclude On

    [OUTPUT]
        Name   elasticsearch
        Match  *
        Host   elasticsearch.logging
        Port   9200
        Index  kubernetes
        Type   _doc

Metrics and Monitoring

service-monitor.yaml

# ServiceMonitor for Prometheus
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: app-metrics
  namespace: production
spec:
  selector:
    matchLabels:
      app: web
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics
    relabelings:
    - sourceLabels: [__meta_kubernetes_pod_label_version]
      targetLabel: version
    - sourceLabels: [__meta_kubernetes_pod_label_environment]
      targetLabel: environment
    metricRelabelings:
    - sourceLabels: [__name__]
      regex: '(request_duration_seconds.*|request_total)'
      action: keep

Distributed Tracing

otel-collector.yaml

# OpenTelemetry configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-config
data:
  otel-collector-config.yaml: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318

    processors:
      batch:
        timeout: 1s
        send_batch_size: 1024
      memory_limiter:
        check_interval: 1s
        limit_mib: 512
      resource:
        attributes:
        - key: cluster.name
          value: production
          action: insert

    exporters:
      jaeger:
        endpoint: jaeger-collector:14250
        tls:
          insecure: true

    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [memory_limiter, batch, resource]
          exporters: [jaeger]

CI/CD Best Practices

GitOps with ArgoCD

argocd-app.yaml

# ArgoCD Application
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: production-app
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/myorg/k8s-config
    targetRevision: HEAD
    path: overlays/production
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
      allowEmpty: false
    syncOptions:
    - Validate=true
    - CreateNamespace=false
    - PrunePropagationPolicy=foreground
    retry:
      limit: 5
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m

Progressive Delivery

canary.yaml

# Flagger Canary Deployment
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: web-canary
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web
  progressDeadlineSeconds: 60
  service:
    port: 80
    targetPort: 8080
    gateways:
    - public-gateway.istio-system.svc.cluster.local
    hosts:
    - app.example.com
  analysis:
    interval: 1m
    threshold: 5
    maxWeight: 50
    stepWeight: 10
    metrics:
    - name: request-success-rate
      thresholdRange:
        min: 99
      interval: 1m
    - name: request-duration
      thresholdRange:
        max: 500
      interval: 1m
    webhooks:
    - name: load-test
      url: http://loadtester/
      timeout: 5s
      metadata:
        cmd: "hey -z 1m -q 10 -c 2 http://web-canary.production/"

CI/CD Best Practices Summary

Use GitOps for declarative deployments
Implement progressive delivery (canary, blue-green)
Automate testing in CI pipeline
Version everything (images, configs, charts)
Use semantic versioning
Implement rollback strategies
Separate config from code
Use Helm or Kustomize for templating

Troubleshooting Guide

Common Issues and Solutions

Issue	Symptoms	Diagnosis	Solution
Pod CrashLoopBackOff	Pod repeatedly crashes and restarts	`kubectl logs pod-name --previous`	Fix application errors, check resources
ImagePullBackOff	Cannot pull container image	`kubectl describe pod pod-name`	Check image name, registry credentials
Pending Pods	Pods stuck in Pending state	`kubectl describe pod pod-name`	Check resources, node capacity, PVCs
OOMKilled	Container killed due to memory	`kubectl describe pod pod-name`	Increase memory limits, optimize app
Service not accessible	Cannot reach service endpoints	`kubectl get endpoints service-name`	Check selectors, readiness probes
Node NotReady	Node in NotReady state	`kubectl describe node node-name`	Check kubelet, disk pressure, memory

Debugging Commands

debug-commands.sh

# Pod debugging
kubectl logs pod-name -c container-name --previous
kubectl exec -it pod-name -- /bin/bash
kubectl debug pod-name -it --image=busybox --share-processes --copy-to=debug-pod

# Service debugging
kubectl port-forward service/my-service 8080:80
kubectl run curl --image=curlimages/curl -it --rm -- sh
nslookup my-service.default.svc.cluster.local

# Node debugging
kubectl get nodes -o wide
kubectl top nodes
kubectl describe node node-name
kubectl get events --sort-by='.lastTimestamp'

# Cluster debugging
kubectl cluster-info dump --output-directory=/tmp/cluster-dump
kubectl get componentstatuses
kubectl get all --all-namespaces
kubectl api-resources --verbs=list --namespaced -o name | \
  xargs -n 1 kubectl get --show-kind --ignore-not-found -n namespace

Common Anti-patterns to Avoid

Using latest tag: Always use specific version tags
No resource limits: Set appropriate requests and limits
Single replica: Run at least 2 replicas for HA
No health checks: Always configure probes
Hardcoded configs: Use ConfigMaps and Secrets
Root containers: Run as non-root user
No PodDisruptionBudget: Define PDBs for critical apps
Ignoring labels: Use consistent labeling strategy

Production Readiness Checklist

Practice Exercises

Easy Configure Resource Requests

Write a Deployment spec for a Node.js app that requests 128Mi memory and 100m CPU, with limits of 256Mi memory and 200m CPU.

Define the resources block under the container spec with both requests and limits sub-keys.

containers:
- name: node-app
  image: node-app:1.0
  resources:
    requests:
      memory: "128Mi"
      cpu: "100m"
    limits:
      memory: "256Mi"
      cpu: "200m"

Medium Add Health Probes

Add liveness, readiness, and startup probes to a web application container that serves HTTP on port 3000 and takes up to 60 seconds to start.

Use a startup probe with failureThreshold * periodSeconds >= 60 so Kubernetes waits long enough. Then add liveness and readiness probes that kick in after startup succeeds.

startupProbe:
  httpGet:
    path: /healthz
    port: 3000
  periodSeconds: 5
  failureThreshold: 12   # 5 * 12 = 60s
livenessProbe:
  httpGet:
    path: /healthz
    port: 3000
  periodSeconds: 10
  failureThreshold: 3
readinessProbe:
  httpGet:
    path: /ready
    port: 3000
  periodSeconds: 5
  failureThreshold: 3

Medium Write a Network Policy

Create a NetworkPolicy that allows a backend pod (label app: api) to receive traffic only from frontend pods and only on port 8080, and allows egress only to a postgres pod on port 5432.

Set policyTypes to both Ingress and Egress. Use podSelector under from and to to target specific pods. Don't forget to allow DNS egress to kube-system.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: api-policy
spec:
  podSelector:
    matchLabels:
      app: api
  policyTypes: [Ingress, Egress]
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: frontend
    ports:
    - protocol: TCP
      port: 8080
  egress:
  - to:
    - podSelector:
        matchLabels:
          app: postgres
    ports:
    - protocol: TCP
      port: 5432
  - to:   # Allow DNS
    - namespaceSelector:
        matchLabels:
          name: kube-system
    ports:
    - protocol: UDP
      port: 53

Hard Production-Ready Deployment

Write a complete production-ready Deployment with: resource limits, all three probes, pod anti-affinity across nodes, a PDB, graceful shutdown, security context running as non-root, and a rolling update strategy.

Combine all the patterns from this tutorial. Use podAntiAffinity with topologyKey: kubernetes.io/hostname, set runAsNonRoot: true, and add a preStop lifecycle hook. Create a separate PDB resource.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    spec:
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app: web
            topologyKey: kubernetes.io/hostname
      containers:
      - name: web
        image: web:1.2.3
        resources:
          requests: { cpu: "250m", memory: "256Mi" }
          limits:   { cpu: "500m", memory: "512Mi" }
        startupProbe:
          httpGet: { path: /healthz, port: 8080 }
          failureThreshold: 30
          periodSeconds: 10
        livenessProbe:
          httpGet: { path: /healthz, port: 8080 }
          periodSeconds: 10
        readinessProbe:
          httpGet: { path: /ready, port: 8080 }
          periodSeconds: 5
        lifecycle:
          preStop:
            exec:
              command: ["sleep", "15"]
        securityContext:
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: web-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: web

Production Readiness Quick Reference

Before deploying to production, verify these four pillars:

Application Readiness

Health checks configured (liveness, readiness, startup)
Graceful shutdown handling
Resource requests and limits set
HPA configured
PDB defined
Anti-affinity rules for HA
Rolling update strategy configured
Secrets managed securely

Observability Readiness

Structured logging implemented
Metrics exposed and collected
Distributed tracing configured
Alerts and dashboards created
SLIs and SLOs defined
Error budgets established
Runbooks documented

Security Readiness

Pod Security Standards enforced
Network Policies implemented
RBAC with least privilege
Image scanning in CI/CD
Secrets encryption at rest
Audit logging enabled
Service mesh with mTLS
Regular security updates

Operational Readiness

GitOps workflow established
Backup and restore tested
Disaster recovery plan documented
Monitoring and alerting configured
On-call rotation established
Documentation up to date
Load testing completed
Chaos engineering practices

Resource Management

Resources are the foundation of stable Kubernetes clusters

Resource Requests and Limits

CPU and Memory Configuration

Quality of Service Classes

Resource Quotas

Resource Management Pitfalls

Health Checks

Three probes, three purposes

Probe Configuration

Liveness Probe

Readiness Probe

Startup Probe

Complete Health Check Example

Auto-Scaling

Horizontal Pod Autoscaler (HPA)

Vertical Pod Autoscaler (VPA)

Cluster Autoscaler

Reliability Patterns

Pod Disruption Budgets

Anti-Affinity Rules

Pod Anti-Affinity

Node Affinity

Priority Classes

Security Best Practices

Pod Security Standards

Network Policies

Security Checklist

Observability

Logging Best Practices

Metrics and Monitoring

Distributed Tracing

CI/CD Best Practices

GitOps with ArgoCD

Progressive Delivery

CI/CD Best Practices Summary

Troubleshooting Guide

Common Issues and Solutions

Debugging Commands

Common Anti-patterns to Avoid

Production Readiness Checklist

Practice Exercises

Easy Configure Resource Requests

Medium Add Health Probes

Medium Write a Network Policy

Hard Production-Ready Deployment

Production Readiness Quick Reference

Application Readiness

Observability Readiness

Security Readiness

Operational Readiness

Related Topics