Platforms & Operations
Technologies and tools for software and infrastructure operations, including platforms and services for managing and scaling applications.- Dash0
changedDash0 is now adopted by all four teams. The Cayenne team completed onboarding this quarter. Dash0's OpenTelemetry-native pipeline and cost-transparent pricing model have proven ideal for our multi-team setup. It serves as the observability backend for all services running on Kubernetes. Custom dashboards provide real-time business KPIs alongside technical metrics, and automatic instrumentation keeps onboarding effort minimal.
Why Dash0
We evaluated Dash0 alongside Datadog, Grafana Cloud, and New Relic. The decision came down to three factors:
OTel-native architecture — Dash0 ingests OTLP directly without requiring proprietary agents. This aligns with our OpenTelemetry strategy and avoids vendor lock-in at the instrumentation layer.
Transparent pricing — per-GB ingestion pricing with no hidden per-host or per-container fees. For our Kubernetes environment with high pod churn, this saves ~40% compared to per-host models.
Automatic instrumentation — Dash0's Kubernetes operator automatically instruments workloads without code changes, which was critical for onboarding the 50+ services already running.
Dashboard Strategy
Each team maintains two dashboard categories:
Technical dashboards — latency percentiles, error rates, pod resource utilization, deployment frequency
Business dashboards — conversion funnels, configurator session duration, API consumer usage patterns
A shared "platform health" dashboard aggregates cross-team SLOs, giving management a single pane for service reliability. Alerts route to PagerDuty with team-specific escalation policies.
Integration Architecture
Application Pods
└─ OTel SDK / auto-instrumentation
└─ OTel Collector (DaemonSet)
└─ OTLP export
└─ Dash0 SaaS
├─ Metrics store
├─ Trace store
└─ Log store
FluentBit collects container logs and forwards them via OTLP, unifying all three signals in a single backend. Correlation between traces, metrics, and logs uses the W3C trace context propagated through service calls.read more… - GitHub Actions
changedGitHub Actions remains our standard CI/CD platform. The Cayenne team has migrated back to a self-hosted GitLab CI instance for compliance reasons specific to their deployment targets. Meanwhile, the Macan team has onboarded and adopted our reusable workflow library. The Taycan and 911 teams continue to expand the workflow library, now covering 95% of pipeline patterns.
Reusable Workflow Library
Our shared workflow library (@porsche-digital/gha-workflows) provides composable, tested CI/CD building blocks:
Workflow
Purpose
Used by
build-and-test.yml
Lint, typecheck, test, build
All repos
deploy-preview.yml
Deploy PR preview environments
Frontend repos
publish-npm.yml
Semantic release to npm
Library repos
docker-build.yml
Multi-arch image build + push to ECR
Backend repos
security-scan.yml
Trivy + CodeQL + dependency audit
All repos
Teams configure pipelines by composing these workflows rather than writing custom YAML. A typical repository's .github/workflows/ci.yml is under 30 lines.
Self-Hosted Runners
We operate a fleet of self-hosted runners on Kubernetes for workloads that exceed GitHub-hosted runner capabilities:
Large builds — monorepo builds that need 16+ GB RAM
Hardware access — runners with USB passthrough for embedded device testing
Network access — runners in our VPC for integration tests against internal services
The runners auto-scale via actions-runner-controller, spinning up pods on demand and scaling to zero during off-hours.
Security Posture
All workflows enforce:
Least-privilege tokens — permissions block is mandatory; workflows without explicit permissions fail PR checks
Pinned actions — third-party actions are pinned to commit SHA, not tags
Secret scanning — GitHub's secret scanning blocks pushes containing credentials
OIDC for cloud access — no long-lived AWS credentials; workflows authenticate via GitHub's OIDC providerread more… - Kubernetes
changedKubernetes remains our standard container orchestration platform. Recent improvements include migration to Cilium for eBPF-based networking, implementation of Crossplane for infrastructure-as-code within K8s, and adoption of Kyverno for policy enforcement. Our platform team now supports 50+ production services across all teams, with observability powered by OpenTelemetry and Dash0. We are actively consolidating AWS Lambda workloads onto the platform.
Platform Overview
We run Kubernetes on AWS EKS across three environments (dev, staging, production) with the following topology:
Production: 3 node groups (general, compute-optimized, GPU) across 3 availability zones
Staging: mirrors production topology at reduced scale
Dev: single node group with spot instances for cost efficiency
Each team operates in dedicated namespaces with resource quotas and network policies enforced by Kyverno. The platform team maintains shared infrastructure (ingress, cert-manager, external-dns, monitoring stack) via Helm charts in a GitOps repository.
Key Infrastructure Decisions
Networking: Cilium
We migrated from AWS VPC CNI to Cilium in Q1 2025. The primary drivers were:
eBPF-based observability — Hubble provides network flow visibility without sidecar overhead
Network policies at L7 — HTTP-aware policies replace our previous Istio service mesh for east-west traffic control
Performance — eBPF dataplane reduced pod-to-pod latency by ~15% compared to iptables-based routing
Policy Enforcement: Kyverno
Kyverno replaced OPA Gatekeeper for policy enforcement:
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-resource-limits
spec:
validationFailureAction: Enforce
rules:
- name: check-limits
match:
any:
- resources:
kinds: ["Pod"]
validate:
message: "CPU and memory limits are required."
pattern:
spec:
containers:
- resources:
limits:
memory: "?*"
cpu: "?*"
Our policy library enforces: resource limits, image provenance (only our ECR registries), label standards, PodDisruptionBudget presence for production workloads, and prohibition of latest tags.
GitOps: Flux
All cluster state is managed via Flux CD pointing at our infrastructure monorepo. The reconciliation loop ensures drift detection — any manual kubectl apply is automatically reverted within 5 minutes.
Scaling Strategy
Dimension
Approach
Tool
Horizontal Pod
Request-based + custom metrics
KEDA
Vertical Pod
Right-sizing recommendations
VPA (recommend mode)
Cluster Nodes
Bin-packing aware
Karpenter
Cost
Spot instances for non-critical
Karpenter + spot interruption handling
Karpenter replaced Cluster Autoscaler in Q2 2025, reducing node provisioning time from ~90s to ~30s and improving bin-packing efficiency by 25%.
Observability Integration
Every pod gets automatic instrumentation via OpenTelemetry Collector running as a DaemonSet:
Metrics → Dash0 (via OTLP)
Traces → Dash0 (via OTLP)
Logs → Dash0 (via FluentBit → OTLP)
Kubernetes events and audit logs feed into our security monitoring pipeline for anomaly detection.
What's Next
We are evaluating Gateway API to replace our Ingress-based routing (more expressive, better multi-team support), and exploring Kubernetes VCluster for ephemeral preview environments to replace our current branch-based namespace approach.read more…
- Backstage
changedBackstage continues in assess with the Taycan team joining the evaluation. Our custom plugins for service scaffolding and API catalog are functional but require significant maintenance. We are weighing whether the developer portal benefits justify the ongoing investment compared to lighter-weight alternatives.
What We Built
Our Backstage instance includes four custom plugins:
Plugin
Purpose
Maintenance effort
Service Scaffolder
Generates new services from team-specific templates
Medium — templates need updating when conventions change
API Catalog
Aggregates OpenAPI specs from all services into a searchable catalog
Low — mostly automated via CI
Tech Radar
Embedded tech radar visualization (ironically, this very radar)
Low — reads from published data
Runbook Viewer
Surfaces operational runbooks linked to service entities
High — keeping runbooks current is a cultural challenge
The Maintenance Question
Backstage's plugin architecture is powerful but demands ongoing investment:
Upgrade cadence — Backstage releases weekly. We fell behind by 3 months and the catch-up migration took a full sprint.
Plugin API changes — the new frontend system (v2 plugin API) requires rewriting all our custom plugins
Build time — the Backstage app takes 4+ minutes to build; local development requires running a separate backend process
The 911 team estimates they spend ~15% of one engineer's time maintaining the Backstage instance.
Alternatives Under Consideration
Port (getport.io) — managed developer portal with lower maintenance overhead
Custom lightweight portal — a thin Next.js app consuming the same data sources, without the Backstage framework
Enhanced README standards — investing in service READMEs and GitHub Actions-powered automation instead of a centralized portal
Decision Criteria
We will decide by Q1 2026 based on:
Does the new frontend system reduce our maintenance burden?
Can we justify the engineering cost for 4 teams, or is this only valuable at larger scale?
Do developers actually use the portal daily, or just for onboarding?
Usage analytics show the API Catalog gets 50+ daily views, but the scaffolder is used ~3 times per month. This asymmetry suggests we might be better served by keeping the catalog and dropping the full Backstage instance.read more… - OpenTelemetry
newOpenTelemetry is being assessed as a vendor-neutral observability framework. We already use Dash0 for monitoring (which is OTel-native), and adopting OTel SDK instrumentation directly would reduce vendor lock-in and provide portable telemetry data across our Kubernetes infrastructure. The 911 and Macan teams are prototyping OTel SDK integration in two services to evaluate the migration path from Datadog-native instrumentation.
Assessment Goals
Our evaluation focuses on three questions:
Migration effort — how much work to replace Datadog's dd-trace with OTel SDK instrumentation in a typical Node.js service?
Feature parity — do we lose any Datadog-specific features (profiling, runtime metrics, APM traces) that our teams depend on?
Operational overhead — does running the OTel Collector as a sidecar or DaemonSet add meaningful resource consumption or failure modes?
Pilot Services
Service
Team
Stack
Status
Vehicle Config API
911
Node.js + Express
OTel SDK integrated, comparing metrics with Datadog
Telemetry Ingestion
Macan
Rust + Tokio
OTel Rust SDK integrated, testing trace export
Both pilots run OTel and Datadog instrumentation in parallel for 4 weeks, comparing data quality, latency impact, and operational stability side-by-side.
Collector Architecture
We are evaluating a DaemonSet deployment (one Collector per node) rather than a sidecar pattern:
Pod (application)
└─ OTLP export (gRPC, port 4317)
└─ OTel Collector (DaemonSet, same node)
├─ Processor: batch, memory limiter, resource detection
└─ Exporter: OTLP → Dash0
The DaemonSet approach reduces per-pod overhead and simplifies configuration — applications only need to know the Collector's local endpoint, not the backend destination.
Expected Outcome
If the assessment succeeds, we plan to standardize on OTel SDK instrumentation across all services, removing all proprietary agent dependencies. Dash0 remains the backend — OTel gives us the flexibility to change backends in the future without re-instrumenting application code.read more…
- AWS Lambda
changedAWS Lambda is moving to hold. While it served well for event-driven workloads, the operational complexity of managing hundreds of Lambda functions has become a burden. We are consolidating serverless workloads into containerized services on Kubernetes, which provides a more consistent operational model across our infrastructure.
Why Hold
Lambda was attractive early on for its zero-ops promise. In practice, at our scale, the operational reality diverged:
Function sprawl — we accumulated 180+ Lambda functions across three AWS accounts, many with unclear ownership
Cold starts — latency-sensitive APIs suffered p99 spikes of 3-5s during cold starts, requiring provisioned concurrency that eroded the cost advantage
Observability gaps — distributed traces across Lambda invocations required custom instrumentation; our Dash0 + OpenTelemetry stack works more naturally with long-running containers
Deployment complexity — each function had its own deployment pipeline, IAM role, and VPC configuration. Managing this at scale consumed more ops time than running containers on Kubernetes
Migration Plan
We are migrating Lambda functions to Kubernetes in three waves:
API handlers (completed) — REST APIs moved to Express.js containers on EKS
Event processors (in progress) — SQS/SNS consumers moving to KEDA-scaled pods on Kubernetes
Scheduled tasks (planned) — cron-triggered Lambdas moving to Kubernetes CronJobs
Functions that remain are truly event-driven, low-volume workloads where Lambda's scale-to-zero makes economic sense (e.g., S3 event processing, CloudFormation custom resources).
Lessons Learned
Lambda is excellent for isolated, event-driven workloads at low-to-moderate scale. For organizations running a platform team with Kubernetes, the incremental cost of running another container is lower than the cognitive overhead of maintaining a parallel serverless ecosystem.read more…