The Problem I Was Actually Solving

I want to start this series with something unrelated to Kubernetes, FastAPI, or Terraform. Before any of that, there was a question that kept bothering me, and SentinelAI was my answer to it.

Here's the question: why do most monitoring dashboards leave you stuck at exactly the moment you need help the most?

You've probably seen this screen a hundred times. CPU utilisation: 74%. Memory usage: 61%. Pods running: 42 out of 45. It looks like information. It feels like you're "monitoring" something. But if a pod is crash-looping right now and you're staring at this screen, none of those numbers tells you what's actually wrong or what you should do about it. You still have to open a terminal, runkubectl describe, scroll through logs, cross-reference events, and basically do detective work under pressure. The dashboard handed you data and walked away.

I strangely came into DevOps. I spent close to two and a half years in BPO operations at Concentrix, processing insurance claims on a mainframe system called RUMBA. If you've never touched a mainframe, just know that the entire job was about precision under process — follow the steps, flag the exception, escalate correctly, don't break the chain. I left that world to teach myself cloud and DevOps, mostly at night, mostly alone, through the IIT Roorkee x Intellipaat program and later the TrainWithShubham 90 Days of DevOps challenge. I scored 119 out of 120 on the final quiz, which still makes me a little proud, not gonna lie.

But here's the thing nobody tells you when you're learning DevOps from tutorials: almost every beginner project teaches you to build the monitoring stack. Spin up Prometheus, install Grafana, hook up some dashboards, ship it. Nobody really stops to ask whether the dashboard you just built actually helps a human decide at 3 AM when something's on fire. I built a few of those projects myself, and every single time, I noticed the same gap. The tooling was correct. The instrumentation was correct. But the moment something actually broke, the dashboard handed me a wall of numbers and said "Good luck."

That gap is the entire reason SentinelAI exists.

I didn't want to build another tool that tells you what's happening. I wanted something that tells you what broke, where it broke, how bad it is, and what to do about it in the next five minutes — because that's the actual question an operator asks during an incident. Nobody pages into a Slack channel at 2 AM asking "what's our CPU utilisation?" They ask, "What's on fire and what do I do?" That's a decision question, not a metrics question, and almost nothing in the beginner-to-intermediate DevOps tooling world is built around answering decision questions. Most things stop at the metric.

So I set myself a constraint while building this: every layer of SentinelAI has to move the operator one step closer to a decision, not just one step closer to a number. A Kubernetes cluster runs, things fail (pods crash, images fail to pull, containers restart endlessly), and instead of just showing you that this is happening, the system should be able to say — here's the namespace that's degraded, here's the pod causing it, here's the evidence from its logs and events, and here's a root cause with a recommended fix, generated automatically. No manual log grepping. No tribal knowledge required. That became the spine of the whole project: discover the failure, score the damage, investigate the evidence, and hand back an answer instead of a graph.

I also made a personal bet here, which I'll talk about properly later in the series: that this answer-generation step could run on a local LLM instead of a cloud API. Partly because of cost, partly because of data privacy (you really don't want pod logs and cluster events leaving your network to a third-party API), and partly because I wanted to prove to myself that a 7-billion-parameter model running on Ollama could actually do useful, structured reasoning about real infrastructure failures. More on that in the AI RCA post.

What you'll see over the next seventeen posts is basically the full build, layer by layer, in the order I actually built it — the FastAPI backend, the Docker hardening, the Kubernetes environments, the CI and security pipelines, the move to AWS EKS, the observability stack, the anomaly detection math, the React dashboard, the health scoring engine, the AI RCA engine itself, and finally GitOps tying it all together with ArgoCD. I'm not going to dump code in these posts — there's a GitHub repo for that. What I want to walk through here is the reasoning behind every decision: why this tool and not that one, why this structure and not a simpler one, where I got it right, and where I'd genuinely do it differently if I started over today.

If there's one sentence I want you to take from Day 1, it's this: a monitoring tool that only tells you something is wrong has done half the job. The other half — telling you what's wrong and what to do about it — is the part almost nobody builds, and it's the part that actually matters when you're the one holding the pager.

Tomorrow, we start with the part everything else sits on top of: the FastAPI backbone, and why I designed the very first set of endpoints the way I did.