Platforms & Operations | Technology Radar

Platforms & Operations

Technologies and tools for software and infrastructure operations, including platforms and services for managing and scaling applications.

Platforms & Operations

Technologies and tools for software and infrastructure operations, including platforms and services for managing and scaling applications.

Adopt

3

Dash0changedDash0 is now adopted by all four teams. The Cayenne team completed onboarding this quarter. Dash0's OpenTelemetry-native pipeline and cost-transparent pricing model have proven ideal for our multi-team setup. It serves as the observability backend for all services running on Kubernetes. Custom dashboards provide real-time business KPIs alongside technical metrics, and automatic instrumentation keeps onboarding effort minimal. Why Dash0 We evaluated Dash0 alongside Datadog, Grafana Cloud, and New Relic. The decision came down to three factors: OTel-native architecture — Dash0 ingests OTLP directly without requiring proprietary agents. This aligns with our OpenTelemetry strategy and avoids vendor lock-in at the instrumentation layer. Transparent pricing — per-GB ingestion pricing with no hidden per-host or per-container fees. For our Kubernetes environment with high pod churn, this saves ~40% compared to per-host models. Automatic instrumentation — Dash0's Kubernetes operator automatically instruments workloads without code changes, which was critical for onboarding the 50+ services already running. Dashboard Strategy Each team maintains two dashboard categories: Technical dashboards — latency percentiles, error rates, pod resource utilization, deployment frequency Business dashboards — conversion funnels, configurator session duration, API consumer usage patterns A shared "platform health" dashboard aggregates cross-team SLOs, giving management a single pane for service reliability. Alerts route to PagerDuty with team-specific escalation policies. Integration Architecture Application Pods └─ OTel SDK / auto-instrumentation └─ OTel Collector (DaemonSet) └─ OTLP export └─ Dash0 SaaS ├─ Metrics store ├─ Trace store └─ Log store FluentBit collects container logs and forwards them via OTLP, unifying all three signals in a single backend. Correlation between traces, metrics, and logs uses the W3C trace context propagated through service calls.read more…
GitHub ActionschangedGitHub Actions remains our standard CI/CD platform. The Cayenne team has migrated back to a self-hosted GitLab CI instance for compliance reasons specific to their deployment targets. Meanwhile, the Macan team has onboarded and adopted our reusable workflow library. The Taycan and 911 teams continue to expand the workflow library, now covering 95% of pipeline patterns. Reusable Workflow Library Our shared workflow library (@porsche-digital/gha-workflows) provides composable, tested CI/CD building blocks: Workflow Purpose Used by build-and-test.yml Lint, typecheck, test, build All repos deploy-preview.yml Deploy PR preview environments Frontend repos publish-npm.yml Semantic release to npm Library repos docker-build.yml Multi-arch image build + push to ECR Backend repos security-scan.yml Trivy + CodeQL + dependency audit All repos Teams configure pipelines by composing these workflows rather than writing custom YAML. A typical repository's .github/workflows/ci.yml is under 30 lines. Self-Hosted Runners We operate a fleet of self-hosted runners on Kubernetes for workloads that exceed GitHub-hosted runner capabilities: Large builds — monorepo builds that need 16+ GB RAM Hardware access — runners with USB passthrough for embedded device testing Network access — runners in our VPC for integration tests against internal services The runners auto-scale via actions-runner-controller, spinning up pods on demand and scaling to zero during off-hours. Security Posture All workflows enforce: Least-privilege tokens — permissions block is mandatory; workflows without explicit permissions fail PR checks Pinned actions — third-party actions are pinned to commit SHA, not tags Secret scanning — GitHub's secret scanning blocks pushes containing credentials OIDC for cloud access — no long-lived AWS credentials; workflows authenticate via GitHub's OIDC providerread more…
KuberneteschangedKubernetes remains our standard container orchestration platform. Recent improvements include migration to Cilium for eBPF-based networking, implementation of Crossplane for infrastructure-as-code within K8s, and adoption of Kyverno for policy enforcement. Our platform team now supports 50+ production services across all teams, with observability powered by OpenTelemetry and Dash0. We are actively consolidating AWS Lambda workloads onto the platform. Platform Overview We run Kubernetes on AWS EKS across three environments (dev, staging, production) with the following topology: Production: 3 node groups (general, compute-optimized, GPU) across 3 availability zones Staging: mirrors production topology at reduced scale Dev: single node group with spot instances for cost efficiency Each team operates in dedicated namespaces with resource quotas and network policies enforced by Kyverno. The platform team maintains shared infrastructure (ingress, cert-manager, external-dns, monitoring stack) via Helm charts in a GitOps repository. Key Infrastructure Decisions Networking: Cilium We migrated from AWS VPC CNI to Cilium in Q1 2025. The primary drivers were: eBPF-based observability — Hubble provides network flow visibility without sidecar overhead Network policies at L7 — HTTP-aware policies replace our previous Istio service mesh for east-west traffic control Performance — eBPF dataplane reduced pod-to-pod latency by ~15% compared to iptables-based routing Policy Enforcement: Kyverno Kyverno replaced OPA Gatekeeper for policy enforcement: apiVersion: kyverno.io/v1 kind: ClusterPolicy metadata: name: require-resource-limits spec: validationFailureAction: Enforce rules: - name: check-limits match: any: - resources: kinds: ["Pod"] validate: message: "CPU and memory limits are required." pattern: spec: containers: - resources: limits: memory: "?*" cpu: "?*" Our policy library enforces: resource limits, image provenance (only our ECR registries), label standards, PodDisruptionBudget presence for production workloads, and prohibition of latest tags. GitOps: Flux All cluster state is managed via Flux CD pointing at our infrastructure monorepo. The reconciliation loop ensures drift detection — any manual kubectl apply is automatically reverted within 5 minutes. Scaling Strategy Dimension Approach Tool Horizontal Pod Request-based + custom metrics KEDA Vertical Pod Right-sizing recommendations VPA (recommend mode) Cluster Nodes Bin-packing aware Karpenter Cost Spot instances for non-critical Karpenter + spot interruption handling Karpenter replaced Cluster Autoscaler in Q2 2025, reducing node provisioning time from ~90s to ~30s and improving bin-packing efficiency by 25%. Observability Integration Every pod gets automatic instrumentation via OpenTelemetry Collector running as a DaemonSet: Metrics → Dash0 (via OTLP) Traces → Dash0 (via OTLP) Logs → Dash0 (via FluentBit → OTLP) Kubernetes events and audit logs feed into our security monitoring pipeline for anomaly detection. What's Next We are evaluating Gateway API to replace our Ingress-based routing (more expressive, better multi-team support), and exploring Kubernetes VCluster for ephemeral preview environments to replace our current branch-based namespace approach.read more…

Trial

0

Assess

2

Hold

1

AWS LambdachangedAWS Lambda is moving to hold. While it served well for event-driven workloads, the operational complexity of managing hundreds of Lambda functions has become a burden. We are consolidating serverless workloads into containerized services on Kubernetes, which provides a more consistent operational model across our infrastructure. Why Hold Lambda was attractive early on for its zero-ops promise. In practice, at our scale, the operational reality diverged: Function sprawl — we accumulated 180+ Lambda functions across three AWS accounts, many with unclear ownership Cold starts — latency-sensitive APIs suffered p99 spikes of 3-5s during cold starts, requiring provisioned concurrency that eroded the cost advantage Observability gaps — distributed traces across Lambda invocations required custom instrumentation; our Dash0 + OpenTelemetry stack works more naturally with long-running containers Deployment complexity — each function had its own deployment pipeline, IAM role, and VPC configuration. Managing this at scale consumed more ops time than running containers on Kubernetes Migration Plan We are migrating Lambda functions to Kubernetes in three waves: API handlers (completed) — REST APIs moved to Express.js containers on EKS Event processors (in progress) — SQS/SNS consumers moving to KEDA-scaled pods on Kubernetes Scheduled tasks (planned) — cron-triggered Lambdas moving to Kubernetes CronJobs Functions that remain are truly event-driven, low-volume workloads where Lambda's scale-to-zero makes economic sense (e.g., S3 event processing, CloudFormation custom resources). Lessons Learned Lambda is excellent for isolated, event-driven workloads at low-to-moderate scale. For organizations running a platform team with Kubernetes, the incremental cost of running another container is lower than the cognitive overhead of maintaining a parallel serverless ecosystem.read more…