🚨 Production Fires & Crashes🤖 AI Agents & Automation🧠 AI Product Development

🏥 Health Tech & AI· 8 min read·Growth Stage·Published April 2026

How a Health-Tech Startup Unblocked a 60-Day AI Stall in 5 Days

Broken LangGraph orchestration, no eval pipeline, zero observability — diagnosed and shipped to 500 real users in a single sprint.

Days to Fix

Error Rate

500

Users Shipped To

Incidents (30 days)

Read the Full Case Study

Executive Summary

The Situation at a Glance

The Problem

AI document pipeline stalled for 60 days
Agents looping with no recursion limit
Shared memory state corrupting outputs
No eval pipeline — every change was a gamble
Zero production observability

What We Built

Full orchestration rebuild in LangGraph
Typed tool results with defined failure modes
Memory isolation — per-session scoped context
150-input evaluation benchmark
Structured per-step logging and cost tracking

The Results

Deployed to 500 users in 5 days
0 production incidents in 30 days
Agent error rate: 45% → 0%
p95 pipeline latency under 200ms
Full cost and error visibility from day one

“The demo worked perfectly. Production didn't. By the time we understood why, two months had passed and we still weren't live.”
— Founder, HealthFlow (anonymised)

The Problem

Why the Demo Worked and Production Did Not

The HealthFlow team had built a 3-agent pipeline that extracted key clinical information from patient intake documents, summarised findings, and flagged anomalies for review. In a controlled environment, it performed well. Six failure modes prevented it from surviving contact with real users.

No Evaluation Pipeline

Every prompt change was shipped blindly. There was no benchmark, no test set, no way to know if a tweak improved or regressed performance.

Impact: 3 silent regressions shipped in 6 weeks — discovered only when users complained

Agent Loops

A misconfigured retry condition caused agents to loop indefinitely on ambiguous inputs. No recursion limit was set.

Impact: Single looping request consumed 100% of the hourly token budget in under 4 minutes

Memory Contamination

Agents shared a global memory store rather than per-session scoped contexts. One user's document context leaked into another's responses.

Impact: HIPAA compliance risk — patient document summaries visible across sessions

Missing Fallback Chain

The 3-step pipeline had no fallback at any stage. A timeout on step 2 broke step 3 with no graceful error state returned to the user.

Impact: 100% of requests during a 12-minute API degradation returned unhandled 500 errors

Zero Observability

No structured logging, no cost tracking per request, no alerting. Failures were discovered via Slack messages from users, not monitoring.

Impact: Average time-to-detect a production failure: 47 minutes

Prompt Brittleness

Prompts were tuned on 12 internal test documents. Real users submitted PDFs with scanned text, tables, and non-English content — none of which the prompts handled.

Impact: Over 30% of real-user inputs produced hallucinated or empty summaries

The Compounding Effect

Each of these failure modes existed independently but interacted in production to create a system that was unreliable in ways that were hard to reproduce and impossible to debug without proper tooling. The team spent 60 days patching symptoms rather than understanding the structural cause — because they had no observability to find the structural cause.

The Solution

A Production-Grade Rebuild in 5 Days

We did not patch the existing system. A structural diagnosis showed that the original architecture had no path to production reliability — it needed to be rebuilt on correct foundations. The sprint was scoped to three deliverables: a production-safe orchestration layer, an evaluation pipeline, and full observability.

LangGraph Orchestration Rebuild

Complete rewrite of the 3-agent pipeline using LangGraph with production-safe patterns.

Typed ToolResult model with success/error/retryable fields
Recursion limit of 15 steps enforced at graph level
Per-session memory isolation via scoped state objects
Fallback chain: primary model → smaller model → graceful error
Structured retry-with-exponential-backoff on all tool calls

✓ Agent error rate: 45% → 0%

✓ Loop incidents: eliminated

✓ Memory contamination: resolved

Evaluation Pipeline

150-input benchmark built from real anonymised user documents — the first automated regression gate in the system.

150 curated real-user document inputs
Automated scoring on extraction accuracy and summary quality
CI integration — every prompt change runs the full benchmark
Regression threshold: any drop >2% fails the build
Weekly benchmark expansion as new edge cases are discovered

✓ Silent regressions: prevented

✓ 3 regressions caught in first 2 weeks

✓ Prompt iteration confidence: restored

Full Observability Stack

Structured logging and cost tracking on every agent step — the visibility layer that should have existed from the start.

Structured JSON logs per step: agent_id, run_id, tokens, latency, cost
Real-time cost tracking per request and per user
PagerDuty alert on error rate >2% over 5-minute window
Grafana dashboard: agent runs, failures, cost per feature
Time-to-detect target: under 2 minutes (from 47 minutes)

✓ Time-to-detect failures: 47 min → <2 min

✓ Full cost visibility from day 1

✓ p95 latency: 1,800ms → 195ms

Sprint Timeline

From first commit to 500 users live — 5 days.

Day 1

Triage & Reproduction

›Reproduced all 6 failure modes in isolation
›Classified root causes into infrastructure / logic / data buckets
›Defined blast radius — confirmed memory contamination scope
›Scoped sprint deliverables and timeline

Days 2–3

Rebuild & Eval Pipeline

›LangGraph graph rewrite with typed tool results
›Memory isolation architecture
›Retry / fallback chain implementation
›150-input benchmark construction and baseline scoring

Day 4

Observability & Testing

›Structured logging on all agent steps
›Cost tracking instrumentation
›Grafana dashboard and PagerDuty alerts
›Full regression suite passing at 100%

Day 5

Deploy & Monitor

›Staged rollout: 10% → 50% → 100% of users
›Real-time monitoring during rollout
›Zero incidents across full 500-user deployment
›Handover documentation and runbook delivered

Results

What Changed in 30 Days

The first 30 days after deployment produced zero production incidents, zero agent loops, and zero user-reported failures — a clean break from the previous 60-day stall.

Reliability

Agent error rate: 45% → 0%
0 production incidents in 30 days
0 agent loops triggered
3 prompt regressions caught before shipping

Performance

p95 pipeline latency: 1,800ms → 195ms
p50 pipeline latency: 620ms → 88ms
First meaningful response: <1 second
Throughput: 10 concurrent requests supported

Operations

Time-to-detect failures: 47 min → <2 min
Full cost visibility per user and request
Benchmark coverage: 150 real-user inputs
On-call runbook: delivered with handover

Before — 60 Days of Stall

45% agent error rate, mostly silent
Agents looping until token budget exhausted
Patient document context leaking across sessions
47-minute average time-to-detect a failure
Every prompt change a manual gamble
No cost visibility — bill arriving as a surprise

After — 500 Users, 0 Incidents

0% agent error rate for 30 consecutive days
Recursion limit enforced — loops impossible
Per-session memory isolation — no cross-contamination
Under 2-minute time-to-detect via PagerDuty
Every prompt change benchmarked before shipping
Real-time cost dashboard per user and feature

Technology

Stack Used in This Engagement

Orchestration

›LangGraph (agent graph)
›Python 3.11
›Pydantic (typed tool results)
›Custom retry-with-backoff layer

AI & Models

›OpenAI GPT-4 (primary)
›GPT-4o-mini (fallback)
›Pinecone (vector store)
›text-embedding-3-small

Backend

›FastAPI
›PostgreSQL (session store)
›Redis (context cache)
›Celery (async tasks)

Infrastructure

›AWS ECS (Fargate)
›AWS RDS PostgreSQL
›CloudWatch Logs
›GitHub Actions CI/CD

Observability

›Grafana (dashboards)
›PagerDuty (alerting)
›Structured JSON logging
›Cost tracking per request

Evaluation

›Custom eval harness (Python)
›150-input benchmark corpus
›CI-gated regression threshold
›Weekly benchmark expansion

Key Takeaways

What This Case Study Teaches

Ship the eval pipeline before you ship the feature

The single biggest cost of the 60-day stall was not the broken code — it was the team's inability to know if their patches were working. An eval pipeline is not optional infrastructure. It is the tool that makes iteration safe.

Observability is not a post-launch concern

A 47-minute time-to-detect is a 47-minute user-impact window. The logging and alerting that most teams deprioritise as "nice to have" is what separates a recoverable incident from a customer churn event.

Proximate and structural causes are different problems

The agent loops were a proximate cause. The absence of recursion limits was a structural cause. Fixing the loop without adding the limit guaranteed the problem would return. Every hotfix needs both layers addressed.

Memory isolation is a compliance issue, not just an architecture one

For health-tech, shared agent state is not merely an engineering problem — it is a HIPAA exposure. Treating memory isolation as a security requirement rather than a performance optimisation changes how urgently it gets built.

Demos and production test different things

A demo tests the happy path with known inputs. Production tests the full input distribution of real users. The gap between these two is where most AI systems fail. Bridging it requires deliberate investment in edge-case testing.

Fixed-scope specialist sprints outpace open-ended hiring

Hiring a senior AI engineer to solve this would have taken 3–6 months and cost 6–10× more than the sprint. For a specific, well-defined problem, specialist execution on a fixed scope is the faster and cheaper path.

Get the Same Result

Your AI Stall Has a Fixed-Price Fix

If your AI pilot has been stalled for more than 30 days, submit a problem description. We will review it and return a scoped recovery offer within 24 hours.

🔍

Diagnosis in 4 Hours

Triage call and root-cause classification within 4 hours of submission.

🔧

Fix in 24–48 Hours

Hotfix deployed with regression tests and observability included.

💰

Fixed Price, No Surprises

Scoped engagement with a single price agreed before we start.

View All Case Studies

No retainer. No open-ended engagement. One problem, one price, one timeline.