Most engineering teams pick one of two bad options for preview environments. They share a staging environment and fight over it. Or they give every engineer a dedicated environment that runs 24 hours a day, 7 days a week, and costs $400 a month per person.
We built a third option. Ephemeral environments that spin up when a PR opens, tear down when it merges, and cost $11 each. At 200 PR cycles a month, that is $2,200 in compute. The same 40-engineer team with always-on environments pays $16,000.
The cost difference comes from three specific architectural choices that most teams get wrong when they try this themselves.
Three Environment Models and Why Two Are Broken
Shared staging is the default for teams under 10 engineers. One environment, everyone shares it. It works until it doesn’t. Queue contention starts when 3-4 engineers need to test simultaneously. Each engineer blocks 30-60 minutes waiting for the environment to be free and stable. At 8 engineers and 4 deploys each per day, you are losing 2-4 hours of merge throughput daily to environment contention. Flaky tests accumulate because the state from the previous engineer’s test run bleeds into the next one.
Per-engineer always-on environments solve the contention problem and introduce a cost problem. A minimal EKS setup per engineer: a t3.medium node group (3 nodes), an Application Load Balancer, and a shared RDS PostgreSQL instance. That is $180/month in compute, $25/month in ALB, and $80/month in RDS, at minimum. With networking and data transfer, $300-500/month per engineer is realistic. Forty engineers at $400 average: $16,000/month for environments that sit idle 16 hours a day.
Ephemeral per-PR environments match environment lifetime to PR lifetime. A PR lives for 2-4 hours on average before merge or abandonment. The environment lives exactly as long as the PR. No idle hours. No contention. No shared state.
| Model | Cost/month (40 engineers) | Queue wait | State isolation | Idle compute |
|---|---|---|---|---|
| Shared staging | ~$400 | 30-60 min/engineer/day | None | Low |
| Per-engineer always-on | ~$16,000 | None | Full | 16 hr/day per env |
| Ephemeral per-PR | ~$2,200 | None | Full | Near zero |
The Architecture: vcluster, Karpenter, and Spot
Three components make ephemeral environments fast enough and affordable enough to be usable.
vcluster creates a virtual Kubernetes cluster inside a namespace of a host cluster. Each virtual cluster gets its own API server, its own etcd, and full Kubernetes resource isolation. From inside the virtual cluster, it looks like a real cluster. From the host cluster’s perspective, it is a namespace with some pods. The host cluster’s actual nodes are shared across all virtual clusters. This is the key: you pay for one node pool, not one per environment.
Karpenter provisions new nodes in 60-90 seconds using EC2 spot instances selected by the best available price in the current availability zone. Without Karpenter, EKS managed node groups take 4-6 minutes to scale out. That 4-minute wait makes ephemeral environments unusable. Engineers will not wait 4 minutes for an environment to spin up per PR. Karpenter brings that to 90 seconds, which is acceptable.
Spot instances cut node cost by 60-80% vs on-demand. The spot interruption risk is acceptable for preview environments: a 2-minute interruption of a CI test environment is a failed pipeline run, not a production incident. Karpenter handles spot interruption with automatic node replacement.
The Cost Breakdown: What $11 Buys and What Blows the Budget
The $11 breaks down across four components:
| Cost component | Amount | Notes |
|---|---|---|
| Spot node compute (3 hr avg) | $6.40 | t3.medium spot at $0.014/hr, 3 nodes, 3 hours |
| vcluster pod overhead | $0.80 | API server + syncer pods on host cluster |
| Ingress controller share | $0.60 | ALB cost amortized across 200 envs/month |
| Network egress | $1.20 | 5 GB avg per PR test run at $0.09/GB |
| Total | $9.00-$13.00 | Varies with spot price and test duration |
The cost cliff hits when teams include a real database per environment. A dedicated RDS db.t3.micro instance costs $0.034/hour. For a 3-hour environment, that is $0.10. But RDS has a minimum billing window of 1 hour, and provisioning takes 8-12 minutes. With 200 environments per month, RDS provisioning latency alone adds 40 hours of billable minutes across the fleet. The real cost is $6.80 per environment just for the database, pushing total cost past $20. Switch to RDS with per-environment schema isolation (one shared RDS, one schema per vcluster, created on spin-up) and database cost drops to under $0.50.
The rule: shared infrastructure with per-environment logical isolation. Never one physical resource per environment.
CI/CD Integration: The PR Lifecycle Trigger
The trigger model uses three GitHub Actions events on the pull_request event type.
opened and synchronize: create or update the vcluster. Deploy the application stack into the virtual cluster. Register the preview URL with the ingress controller. Post the preview URL as a PR comment.
closed: delete the vcluster and its namespace. The spot nodes that were serving only that vcluster get deprovisioned by Karpenter within 30 seconds if no other workloads need them.
The abandoned PR case needs a separate cron. GitHub does not fire a closed event for PRs that authors stop updating without closing. A 4-hour cleanup cron deletes any vcluster whose associated PR has had no commits or comments in 8 hours. This eliminates zombie environments that drift past the PR lifetime.
Three Failure Modes at Scale
Teams that attempt ephemeral environments at 50+ PRs per month reliably hit three specific failure modes.
Persistent Volumes. The moment a team adds a PersistentVolumeClaim to the environment spec, spin-up time jumps from 90 seconds to 4-6 minutes. AWS EBS volume provisioning averages 45-90 seconds per volume. With 3 services each mounting a PV, that is 3-4 minutes of storage provisioning before the first pod is ready. The fix: use emptyDir volumes in ephemeral environments. Any data that needs to persist across pod restarts within the same environment should use in-memory stores or the shared RDS schema, not block storage.
DNS Wildcard with per-vcluster CoreDNS. Each vcluster runs its own CoreDNS instance for in-cluster service discovery. External DNS wildcard routing (*.preview.yourdomain.com) requires the host cluster’s ingress controller, not the vcluster’s CoreDNS. Teams that configure DNS at the vcluster level find that the preview URL resolves inside the virtual cluster but not from the internet. The fix: all external DNS registration goes through the host cluster’s ExternalDNS deployment. The vcluster only handles in-cluster service-to-service DNS.
Database state seeding. Some services require 500 MB of seed data to be functional. If seeding runs on every spin-up, a 90-second environment becomes a 12-minute environment. The fix: pre-seed a base schema snapshot at the host cluster level and copy the snapshot into each per-PR schema on creation. A pg_dump of a 500 MB database restores in 45-60 seconds, not 10 minutes of application-level seeding.
The platform-engineering work of building golden paths for self-service infrastructure applies directly here: the ephemeral environment spec is a golden path. Engineers should not configure it. They should invoke it. The moment configuration leaks to the application team, the three failure modes above multiply. Keep the environment spec in the platform team’s Terraform module. Application teams pass one input: the container image tag.