Posts tagged sre.

ZopDev writing tagged sre. Engineering and FinOps notes, post-mortems, and benchmarks.

terraform

The Alert Fatigue Problem in Cloud Policy Management

Traditional cloud alerting creates more work than it prevents because engineers spend 60-90 minutes per day triaging notifications that describe problems without fixing them. The mechanism is…

Muskan Bandta May 19 · 17 min

cloudops

Kubernetes MTTR: From 43 Minutes to 9 With Structured Runbooks

Kubernetes MTTR: From 43 Minutes to 9 With Structured Runbooks The median Kubernetes incident takes 43 minutes to resolve. Eight minutes of that is the actual fix. The other 35 minutes is engineers…

Riya Mittal May 13 · 9 min

cloudops

Live Kubernetes Visibility: 21 Resource Pages and the Crashloop Overview

A 500-pod cluster has one pod that restarted three times in the last 10 minutes. The operator on call does not know which pod. returns 500 lines of and a handful of interleaved through them. Finding…

Muskan Bandta May 11 · 11 min

cloudops

Closed-Loop SRE for Kubernetes: Auto-Remediating Pod Crashloops Before the On-Call Pages

The 3am page is rarely about something that needs a human. The on-call gets paged at 03:14 because a pod has crashlooped four times in five minutes. They open Slack, look at the logs, see "OOMKilled"…

Muskan Bandta May 7 · 9 min

cloud-automation

Closed-Loop Cloud Remediation: How Autonomous Policies Replace On-Call Runbooks

The average remediation event takes 47 minutes in runbook-driven ops. The fix takes 4. Closed-loop remediation eliminates the overhead — here's the full technical architecture and how to start with your first policy.

Riya Mittal Apr 20 · 8 min

Systems that Scale Podcast: EP1 (The AI Shift in DevOps and SRE)

Real lessons from DevOps at scale. Episode 1 of Systems That Scale covers SRE breakdowns, operational complexity, and the rise of AI driven reliability.

Talvinder Singh Dec 17 · 3 min

← Back to all posts

Get the weekly in your inbox.

One post a week. Sundays. No "10 ways to think about cloud" listicles, just the engineering and FinOps notes we'd want to read.

ZopNight

ZopDay

ZopCloud

The IDP Adoption Problem: Why Most Platforms Fail

Founded 2024.

Careers

Contact

Posts tagged sre.

The Alert Fatigue Problem in Cloud Policy Management

Kubernetes MTTR: From 43 Minutes to 9 With Structured Runbooks

Live Kubernetes Visibility: 21 Resource Pages and the Crashloop Overview

Closed-Loop SRE for Kubernetes: Auto-Remediating Pod Crashloops Before the On-Call Pages

Closed-Loop Cloud Remediation: How Autonomous Policies Replace On-Call Runbooks

Systems that Scale Podcast: EP1 (The AI Shift in DevOps and SRE)

Get the weekly in your inbox.

Stop watching the waste.
Start cutting it.

The Alert Fatigue Problem in Cloud Policy Management

Kubernetes MTTR: From 43 Minutes to 9 With Structured Runbooks

Live Kubernetes Visibility: 21 Resource Pages and the Crashloop Overview

Closed-Loop SRE for Kubernetes: Auto-Remediating Pod Crashloops Before the On-Call Pages

Closed-Loop Cloud Remediation: How Autonomous Policies Replace On-Call Runbooks

Systems that Scale Podcast: EP1 (The AI Shift in DevOps and SRE)

Get the weekly in your inbox.

Stop watching the waste.Start cutting it.

Stop watching the waste.
Start cutting it.