We Built an AI That Watches Our AI: The Ops Observer

✍️ This post was written collaboratively by Arun Batchu and Cascade, the AI pair programmer that built this system alongside him.

Shilpiworks runs seven specialized AI agents around the clock: Stoic philosophy quotes, women's wisdom, Indigenous proverbs, Marshall Goldsmith leadership quotes, Scientific Wonder, Minnesota Wildlife, and a general-purpose agent. Together they generate and publish stickers automatically, 24 hours a day.

The problem with autonomous systems is that they fail quietly. An API rate limit, a model timeout, a database connection blip — any of these can cause an agent to fail without anyone noticing. We'd check the AgentRun table the next morning and find a string of failures from 3 AM with no alert, no notification, nothing.

The solution was obvious in retrospect: build an eighth agent whose only job is to watch the other seven.

The Architecture

The Ops Observer is a Mastra workflow with three steps, running twice a day via Vercel cron:

Gather Data — Query the AgentRun table for all runs in the last 12 hours. Count failures, group identical error messages by frequency.
Diagnose — If 2 or more failures are detected, pass the error log to Gemini with a prompt asking it to act as a senior SRE. It returns a structured diagnosis: is this a systemic issue or random noise? What is likely breaking? What should the developer check first?
Alert — If the diagnosis flags a systemic anomaly, send an email via Resend with the error summary, the AI diagnosis, and suggested next steps.

// The workflow chain — simple and readable
const opsObserverWorkflow = createWorkflow({ id: 'ops-observer', ... })
  .then(gatherDataStep)
  .map(async ({ getStepResult }) => {
    const data = getStepResult('gather-data');
    return { failedRuns: data.failedRuns, errors: data.errors, ... };
  })
  .then(diagnoseStep)
  .map(async ({ getStepResult }) => {
    const diag = getStepResult('diagnose-issues');
    return { hasAnomaly: diag.hasAnomaly, diagnosis: diag.diagnosis, ... };
  })
  .then(alertStep)
  .commit();

The Threshold Design

One of the most important design decisions was the failure threshold. A single failure in 12 hours is almost certainly random — a transient API timeout, a momentary network blip. Alerting on every single failure would create noise and train us to ignore the alerts.

We set the threshold at 2 or more failures before triggering the diagnosis step. Below that, the observer logs the data and exits silently. At 2+, it escalates to Gemini for diagnosis. This keeps the signal-to-noise ratio high.

Design principle: An alert system that cries wolf trains humans to ignore it. The threshold is as important as the detection logic.

AI Diagnosing AI

The most interesting part of the system is the diagnosis step. We pass Gemini the raw error log — grouped by frequency, with agent type and timestamp context — and ask it to reason about what's happening. Is this a systemic issue (e.g., all agents failing with the same OpenAI error) or isolated noise (one agent failing with a unique error)?

The output is a structured JSON object with three fields: hasAnomaly (boolean), diagnosis (a plain-English explanation of what's breaking and why), and potentialPatches (1-2 specific suggested fixes, including file paths when obvious). This gets formatted into an HTML email and sent via Resend.

In practice, Gemini is surprisingly good at this. When we had a run of OpenAI rate limit errors, it correctly identified the pattern, noted that all failures were from the same error class, and suggested checking the API quota dashboard and adding exponential backoff. When we had a one-off database connection timeout, it correctly classified it as likely transient and recommended monitoring for recurrence before taking action.

What the Alert Email Looks Like

The email arrives with subject 🚨 [URGENT] Shilpiworks Ops Observer: Production Anomaly Detected and contains:

Stats: X failures out of Y total runs in the last 12 hours
Top errors grouped by frequency (e.g., "3x: OpenAI rate limit exceeded")
AI Diagnosis: a paragraph explaining the likely root cause
Suggested Patches: specific code or config changes to investigate

The Broader Lesson

As you add more autonomous agents to a system, observability becomes the hardest problem. Each agent is a black box that runs on a schedule, produces output, and fails silently when something goes wrong. The natural human response is to check the database manually — which doesn't scale.

The Ops Observer pattern — a lightweight monitoring agent that queries your run history, applies AI reasoning to detect anomalies, and routes alerts to a human — is a practical solution that scales with the fleet. As we add more agents, the observer automatically covers them too, since it queries the entire AgentRun table regardless of agent type.

The meta-lesson: Autonomous systems need autonomous monitors. If you're building a fleet of AI agents, build the observer before you need it — not after the first silent failure. See the stickers these agents produce at shilpiworks.com →

The Architecture

The Threshold Design

AI Diagnosing AI

What the Alert Email Looks Like

The Broader Lesson

About the author

Arun Batchu

The Vercel Edge Cache Trap: Why force-dynamic Doesn't Always Work

When AI Generates Art: The Hidden Challenge of OCR and Transparency