HistoryAbout Tech Radar
Adopt
Platforms & Operations
Since Jul 2024

Kubernetes remains our standard container orchestration platform. Recent improvements include migration to Cilium for eBPF-based networking, implementation of Crossplane for infrastructure-as-code within K8s, and adoption of Kyverno for policy enforcement. Our platform team now supports 50+ production services across all teams, with observability powered by OpenTelemetry and Dash0. We are actively consolidating AWS Lambda workloads onto the platform.

Platform Overview

We run Kubernetes on AWS EKS across three environments (dev, staging, production) with the following topology:

  • Production: 3 node groups (general, compute-optimized, GPU) across 3 availability zones
  • Staging: mirrors production topology at reduced scale
  • Dev: single node group with spot instances for cost efficiency

Each team operates in dedicated namespaces with resource quotas and network policies enforced by Kyverno. The platform team maintains shared infrastructure (ingress, cert-manager, external-dns, monitoring stack) via Helm charts in a GitOps repository.

Key Infrastructure Decisions

Networking: Cilium

We migrated from AWS VPC CNI to Cilium in Q1 2025. The primary drivers were:

  • eBPF-based observability — Hubble provides network flow visibility without sidecar overhead
  • Network policies at L7 — HTTP-aware policies replace our previous Istio service mesh for east-west traffic control
  • Performance — eBPF dataplane reduced pod-to-pod latency by ~15% compared to iptables-based routing

Policy Enforcement: Kyverno

Kyverno replaced OPA Gatekeeper for policy enforcement:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-resource-limits
spec:
  validationFailureAction: Enforce
  rules:
    - name: check-limits
      match:
        any:
          - resources:
              kinds: ["Pod"]
      validate:
        message: "CPU and memory limits are required."
        pattern:
          spec:
            containers:
              - resources:
                  limits:
                    memory: "?*"
                    cpu: "?*"

Our policy library enforces: resource limits, image provenance (only our ECR registries), label standards, PodDisruptionBudget presence for production workloads, and prohibition of latest tags.

GitOps: Flux

All cluster state is managed via Flux CD pointing at our infrastructure monorepo. The reconciliation loop ensures drift detection — any manual kubectl apply is automatically reverted within 5 minutes.

Scaling Strategy

Dimension Approach Tool
Horizontal Pod Request-based + custom metrics KEDA
Vertical Pod Right-sizing recommendations VPA (recommend mode)
Cluster Nodes Bin-packing aware Karpenter
Cost Spot instances for non-critical Karpenter + spot interruption handling

Karpenter replaced Cluster Autoscaler in Q2 2025, reducing node provisioning time from ~90s to ~30s and improving bin-packing efficiency by 25%.

Observability Integration

Every pod gets automatic instrumentation via OpenTelemetry Collector running as a DaemonSet:

  • Metrics → Dash0 (via OTLP)
  • Traces → Dash0 (via OTLP)
  • Logs → Dash0 (via FluentBit → OTLP)

Kubernetes events and audit logs feed into our security monitoring pipeline for anomaly detection.

What's Next

We are evaluating Gateway API to replace our Ingress-based routing (more expressive, better multi-team support), and exploring Kubernetes VCluster for ephemeral preview environments to replace our current branch-based namespace approach.

Edit
Jan 2025
Kubernetes adoption is now organization-wide. All four teams run production workloads on our multi-c…more
taycan
Jul 2024
TrialAdopt
Kubernetes has moved to adopt. All production workloads now run on managed EKS clusters with Istio s…more
cayenne
Jan 2024
Trial
Kubernetes is being trialed as our container orchestration platform. The 911 team has migrated two m…more
911macan