Ephemeral Kubernetes Preview Environments: How We Run 200 On-Demand Clusters for $11 Each

Most engineering teams pick one of two bad options for preview environments. They share a staging environment and fight over it. Or they give every engineer a dedicated environment that runs 24 hours a day, 7 days a week, and costs $400 a month per person.

We built a third option. Ephemeral environments that spin up when a PR opens, tear down when it merges, and cost $11 each. At 200 PR cycles a month, that is $2,200 in compute. The same 40-engineer team with always-on environments pays $16,000.

The cost difference comes from three specific architectural choices that most teams get wrong when they try this themselves.

Three Environment Models and Why Two Are Broken

Shared staging is the default for teams under 10 engineers. One environment, everyone shares it. It works until it doesn’t. Queue contention starts when 3-4 engineers need to test simultaneously. Each engineer blocks 30-60 minutes waiting for the environment to be free and stable. At 8 engineers and 4 deploys each per day, you are losing 2-4 hours of merge throughput daily to environment contention. Flaky tests accumulate because the state from the previous engineer’s test run bleeds into the next one.

Per-engineer always-on environments solve the contention problem and introduce a cost problem. A minimal EKS setup per engineer: a t3.medium node group (3 nodes), an Application Load Balancer, and a shared RDS PostgreSQL instance. That is $180/month in compute, $25/month in ALB, and $80/month in RDS, at minimum. With networking and data transfer, $300-500/month per engineer is realistic. Forty engineers at $400 average: $16,000/month for environments that sit idle 16 hours a day.

Ephemeral per-PR environments match environment lifetime to PR lifetime. A PR lives for 2-4 hours on average before merge or abandonment. The environment lives exactly as long as the PR. No idle hours. No contention. No shared state.

Model	Cost/month (40 engineers)	Queue wait	State isolation	Idle compute
Shared staging	~$400	30-60 min/engineer/day	None	Low
Per-engineer always-on	~$16,000	None	Full	16 hr/day per env
Ephemeral per-PR	~$2,200	None	Full	Near zero

The Architecture: vcluster, Karpenter, and Spot

Three components make ephemeral environments fast enough and affordable enough to be usable.

vcluster creates a virtual Kubernetes cluster inside a namespace of a host cluster. Each virtual cluster gets its own API server, its own etcd, and full Kubernetes resource isolation. From inside the virtual cluster, it looks like a real cluster. From the host cluster’s perspective, it is a namespace with some pods. The host cluster’s actual nodes are shared across all virtual clusters. This is the key: you pay for one node pool, not one per environment.

Karpenter provisions new nodes in 60-90 seconds using EC2 spot instances selected by the best available price in the current availability zone. Without Karpenter, EKS managed node groups take 4-6 minutes to scale out. That 4-minute wait makes ephemeral environments unusable. Engineers will not wait 4 minutes for an environment to spin up per PR. Karpenter brings that to 90 seconds, which is acceptable.

Spot instances cut node cost by 60-80% vs on-demand. The spot interruption risk is acceptable for preview environments: a 2-minute interruption of a CI test environment is a failed pipeline run, not a production incident. Karpenter handles spot interruption with automatic node replacement.

The Cost Breakdown: What $11 Buys and What Blows the Budget

The $11 breaks down across four components:

Cost component	Amount	Notes
Spot node compute (3 hr avg)	$6.40	t3.medium spot at $0.014/hr, 3 nodes, 3 hours
vcluster pod overhead	$0.80	API server + syncer pods on host cluster
Ingress controller share	$0.60	ALB cost amortized across 200 envs/month
Network egress	$1.20	5 GB avg per PR test run at $0.09/GB
Total	$9.00-$13.00	Varies with spot price and test duration

The cost cliff hits when teams include a real database per environment. A dedicated RDS db.t3.micro instance costs $0.034/hour. For a 3-hour environment, that is $0.10. But RDS has a minimum billing window of 1 hour, and provisioning takes 8-12 minutes. With 200 environments per month, RDS provisioning latency alone adds 40 hours of billable minutes across the fleet. The real cost is $6.80 per environment just for the database, pushing total cost past $20. Switch to RDS with per-environment schema isolation (one shared RDS, one schema per vcluster, created on spin-up) and database cost drops to under $0.50.

The rule: shared infrastructure with per-environment logical isolation. Never one physical resource per environment.

CI/CD Integration: The PR Lifecycle Trigger

The trigger model uses three GitHub Actions events on the pull_request event type.

opened and synchronize: create or update the vcluster. Deploy the application stack into the virtual cluster. Register the preview URL with the ingress controller. Post the preview URL as a PR comment.

closed: delete the vcluster and its namespace. The spot nodes that were serving only that vcluster get deprovisioned by Karpenter within 30 seconds if no other workloads need them.

The abandoned PR case needs a separate cron. GitHub does not fire a closed event for PRs that authors stop updating without closing. A 4-hour cleanup cron deletes any vcluster whose associated PR has had no commits or comments in 8 hours. This eliminates zombie environments that drift past the PR lifetime.

Three Failure Modes at Scale

Teams that attempt ephemeral environments at 50+ PRs per month reliably hit three specific failure modes.

Persistent Volumes. The moment a team adds a PersistentVolumeClaim to the environment spec, spin-up time jumps from 90 seconds to 4-6 minutes. AWS EBS volume provisioning averages 45-90 seconds per volume. With 3 services each mounting a PV, that is 3-4 minutes of storage provisioning before the first pod is ready. The fix: use emptyDir volumes in ephemeral environments. Any data that needs to persist across pod restarts within the same environment should use in-memory stores or the shared RDS schema, not block storage.

DNS Wildcard with per-vcluster CoreDNS. Each vcluster runs its own CoreDNS instance for in-cluster service discovery. External DNS wildcard routing (*.preview.yourdomain.com) requires the host cluster’s ingress controller, not the vcluster’s CoreDNS. Teams that configure DNS at the vcluster level find that the preview URL resolves inside the virtual cluster but not from the internet. The fix: all external DNS registration goes through the host cluster’s ExternalDNS deployment. The vcluster only handles in-cluster service-to-service DNS.

Database state seeding. Some services require 500 MB of seed data to be functional. If seeding runs on every spin-up, a 90-second environment becomes a 12-minute environment. The fix: pre-seed a base schema snapshot at the host cluster level and copy the snapshot into each per-PR schema on creation. A pg_dump of a 500 MB database restores in 45-60 seconds, not 10 minutes of application-level seeding.

The platform-engineering work of building golden paths for self-service infrastructure applies directly here: the ephemeral environment spec is a golden path. Engineers should not configure it. They should invoke it. The moment configuration leaks to the application team, the three failure modes above multiply. Keep the environment spec in the platform team’s Terraform module. Application teams pass one input: the container image tag.

ZopNight

ZopDay

ZopCloud

The IDP Adoption Problem: Why Most Platforms Fail

Founded 2024.

Careers

Contact

Ephemeral Kubernetes Preview Environments: How We Run 200 On-Demand Clusters for $11 Each

Three Environment Models and Why Two Are Broken

The Architecture: vcluster, Karpenter, and Spot

The Cost Breakdown: What $11 Buys and What Blows the Budget

CI/CD Integration: The PR Lifecycle Trigger

Three Failure Modes at Scale

Muskan Bandta

OOMKill Is the Next Lie: Why Kubernetes Memory Limits Are Hiding Your Latency Spikes

Stop watching the waste.
Start cutting it.

Three Environment Models and Why Two Are Broken

The Architecture: vcluster, Karpenter, and Spot

The Cost Breakdown: What $11 Buys and What Blows the Budget

CI/CD Integration: The PR Lifecycle Trigger

Three Failure Modes at Scale

Muskan Bandta

Related articles

Self-Service Terraform: 8 Modules That Killed 60% of Our Platform Tickets

ZopDay: Provisioning EKS, GKE, AKS, and a Managed Datastore in One 8-Step Wizard

Cloud Custodian vs OPA vs MCP-Enforced Policy: A 2026 Decision Matrix for Autonomous Remediation

OOMKill Is the Next Lie: Why Kubernetes Memory Limits Are Hiding Your Latency Spikes

Stop watching the waste.Start cutting it.

Stop watching the waste.
Start cutting it.