Skip to main content
home / zopday / kubernetes view
ZopDay · Kubernetes View · LIVE

Every cluster, every namespace, one view.

A live cross-account topology of your Kubernetes estate — EKS, GKE, AKS, and self-managed. Node health, pod state, workload cost, drift, autoscaler signal — reconciled against the live cluster API every 60 seconds. The same view the SRE on-call uses at 2 a.m.

60sReconcile interval
3Clouds (EKS · GKE · AKS)
450+Drift & cost rules
0Mutations without policy
01 · overview

The same picture the on-call SRE has.

Most cost dashboards stop at the cloud bill. Most observability stacks stop at logs and traces. Neither tells you what your cluster is actually doing right now. Kubernetes View sits in the middle — it reads the cluster state directly via the Kubernetes API, joins it to billing data, and renders it as a single live topology.

Use it to answer questions like:

  • Which namespace is burning the most CPU this hour, and which deployment inside it is responsible?
  • Which pods have been pending for more than 15 minutes, and why?
  • Where are HPAs flapping? Where is VPA scaling against itself?
  • Which clusters have drifted from their target node pool size since the last deploy?
Read-only by default

Connect a cluster with read scopes and you get the full view. Mutating actions (cordon, drain, scale, schedule) require an explicit policy grant and are admin-gated.

02 · topology

One canvas, every cluster.

The topology panel is the entry point. It groups your estate by provider → account → region → cluster → namespace → workload. Every node carries live status: node count, pod count, CPU/memory pressure, cost per hour, drift count.

LevelWhat it shows
ProviderAWS / GCP / Azure / self-managed totals: clusters, monthly cost, health.
Account / ProjectPer-account cluster count, region spread, top spenders.
ClusterNode groups, pool autoscaler state, control-plane version, last reconcile timestamp.
NamespaceCost, pod count, restart rate, OOMKilled count, pending pods.
WorkloadDeployment / StatefulSet / DaemonSet detail: replicas, HPA target, VPA recommendation, last-deploy SHA.

Click any node and the side panel opens with the live state, the 24-hour trend, and the open audit findings. No tab switching.

03 · workload cost

Cost down to the pod.

The cluster bill is rarely the question. The question is usually: "which team's workloads moved the number last week?"

Kubernetes View allocates spend three ways:

By owner

Joins live pod metadata to your Auto Tagging dictionary (team, env, service). Surfaces unowned pods so they don’t silently roll up into "shared infrastructure".

By workload type

Splits batch jobs, long-running services, sidecars, and DaemonSets. A noisy sidecar costing $14K/month shouldn’t hide inside the parent service.

By scheduler decision

Tracks how much of your hourly bill comes from Spot vs On-Demand vs Reserved capacity, and how many pods are evicted per hour. If your Spot mix is silently degrading, this is where you see it first.

Why this matters

"Our Kubernetes bill went up 18% this month" is unactionable. "Three namespaces in the prod-us-east cluster moved from 60% Spot to 12% Spot after the Karpenter consolidation rule changed" is fixable in 20 minutes.

04 · drift detection

What changed since the last green deploy.

Kubernetes clusters drift constantly — HPAs adjust, autoscalers add nodes, operators rotate. Most of it is fine. Some of it costs you money or quietly breaks SLOs.

Kubernetes View runs continuous drift detection on:

  • Node pool size — current vs target, with explanation (autoscaler vs manual vs Karpenter consolidation).
  • HPA flapping — scale events crossing the same boundary more than 6 times in an hour.
  • VPA contention — pods where VPA wants to resize but HPA is also active.
  • Restart loops — pods with >3 restarts in the last hour, grouped by CrashLoopBackoff cause.
  • Stuck rollouts — Deployments where the new ReplicaSet has been Progressing for more than 30 minutes.
  • Untagged workloads — pods landing in a namespace without the required owner label.
  • Image bloat — new image >25% larger than the prior one, with a link to the diff.

Every drift finding carries severity, projected dollar impact, and a one-click jump to the workload, the audit rule, and the suggested remediation.

05 · scheduler signal

Karpenter, Cluster Autoscaler, KEDA. One panel.

The autoscaler is usually the single biggest cost lever in a cluster, and the single most opaque component. Kubernetes View surfaces the scheduler decision stream as a first-class panel:

SignalWhat you see
Karpenter provisioning eventsWhy a node was added, which pods triggered it, instance type, hourly cost.
Cluster Autoscaler scale-downEligible nodes, reasons nodes are blocked from scale-down (PDBs, system pods, kubelet config).
KEDA scaler eventsExternal-metric scalers (SQS, Kafka, Cron) and their current trigger thresholds.
Spot interruptionsLast 24 hours of interruptions, per instance type, with workload impact.
Pending pod reasonsFailedScheduling events grouped by reason (insufficient memory, taints, affinity).
06 · safety model

Read-only is the default. Always.

Kubernetes View ships with three discrete permission tiers. Most customers run forever on Tier 1.

Tier 1 · Read-only

Cluster API permissions: get, list, watch. Billing read scopes on the cloud account. No mutations. Sufficient for the entire topology, cost, and drift surface.

Tier 2 · Guided

Adds the ability to propose a remediation (scale, schedule, drain) into the policy console. Execution still requires an explicit human approval. Useful when an on-call wants the platform to draft the kubectl for them.

Tier 3 · Policy-driven

The platform executes within an explicitly-scoped policy — e.g. "after 21:00 IST on non-prod namespaces, scale ReplicaSets matching label env=dev to zero, unless the namespace carries the label keep-on=true." Every action is admin-gated, scoped, and logged with actor + timestamp + diff.

What we never touch

Customer-managed CRDs, operator-managed StatefulSets, and any workload labelled zopdev.io/protected=true. The platform refuses, by design. Section 6 of the CDCR whitepaper goes into the architectural rationale.

07 · integrations

Already in your stack.

SourceWhat we read
Kubernetes API (any conformant cluster)Workloads, pods, nodes, events, metrics-server, HPA/VPA state.
EKS / GKE / AKS control planesCluster version, addon state, node pool config, control-plane logs.
Karpenter / Cluster Autoscaler / KEDAProvisioner CRDs, scaler events, decisions.
Prometheus / OpenTelemetryWorkload-level CPU/memory/network. Optional — falls back to metrics-server.
AWS CUR / GCP BigQuery billing / Azure Cost MgmtHourly cost allocation joined to live pod state.
GitHub / GitLabLast-deploy SHA, image source, blame link on drift findings.
Slack / PagerDutyDrift alerts, budget burn-down, weekly digest.
08 · faq

Common questions.

Do you require an agent in the cluster?

No. The default mode uses the Kubernetes API directly via an IAM role / workload identity. An optional in-cluster agent is available for sub-30s reconcile intervals and richer container metadata.

What’s the latency?

API-mode reconciles every 60 seconds by default (configurable to 30s). Agent-mode streams events in real time. The UI itself updates within ~500ms of a cluster event.

Multi-tenancy?

Yes. Org → team → namespace scoping is enforced server-side. A team that owns billing-prod sees only their namespaces; the FinOps lead sees the whole estate.

Self-managed clusters (kops, kubeadm, RKE2)?

Supported. Anything that exposes a conformant Kubernetes API works. The cloud-specific integrations (CUR, billing) are optional.

How does this relate to the rest of ZopDev?

Kubernetes View is the live operational surface inside ZopDay. The cost lens is shared with ZopNight. The topology graph is shared with ZopCloud. One inventory, three lenses.

Multi-cloud automation· Production-ready in 30 min· SOC 2 · ISO 27001 · zero-trust· 30% average cloud cost cut· 4 platforms · 1 console· Multi-cloud automation· Production-ready in 30 min· SOC 2 · ISO 27001 · zero-trust· 30% average cloud cost cut· 4 platforms · 1 console·