The All-Too-Familiar Pain of Being On-Call

It’s 3 AM. An alert jolts you awake. A critical service is down. Your mind races as you scramble to open your laptop. The next hour is a frantic scavenger hunt across a dozen browser tabs. You check ServiceNow for the incident ticket. Grafana provides cryptic logs. You look at GitHub to find the code that was last deployed. Every minute of this manual toil increases downtime and pressure.
This chaotic, high-stakes process is a reality for most Site Reliability (SRE) and DevOps teams. But what if we could build an AI-powered sidekick to do the heavy lifting? That was the core idea behind “Log Sherlock,” a project I built for an AIOps hackathon.
The Vision: One-Click Root Cause Analysis
The goal was simple yet ambitious: transform the incident response workflow from a manual investigation into a single automated action. The system is designed to autonomously:

Fetch Incident Details: Pulls context like the affected service and error messages from an incident management tool.
Gather Evidence: Queries Grafana for relevant production logs from the time of the incident.
Find the Culprit Code: Searches the correct GitHub repository for the source code related to the error message.
Perform AI Analysis: Feeds all this context—incident details, logs, and code—to a Large Language Model using a highly specialized prompt.
Deliver the Verdict: It presents a concise and actionable root cause analysis directly to the user. This is done in a clean web UI powered by Flask.
How It Works: The Special Sauce is in the Prompt
The project’s intelligence isn’t just about calling an LLM; it’s about turning a general-purpose model into a specialized SRE.
The final system prompt instructs the AI to act as an “expert software incident analyst and Site Reliability Engineer.” It provides strict guidelines for evidence-based reasoning. The output must be a structured markdown format. This ensures the analysis is not a vague summary but a technical, actionable report.
The Future: Towards Autonomous Agents
This hackathon project lays the groundwork for a more powerful AIOps platform. This suggests a future where autonomous agents could diagnose incidents. They could also draft the code for the fix, truly revolutionizing the on-call experience.
Intelligently combining existing tools with specialized LLM prompts allows us to build powerful AI sidekicks. These AI sidekicks fundamentally improve the way we build and maintain software.
Leave a comment