How a Health-Tech Startup Unblocked a 60-Day AI Stall in 5 Days
Broken LangGraph orchestration, no eval pipeline, zero observability — diagnosed and shipped to 500 real users in a single sprint.
The Situation at a Glance
The Problem
- AI document pipeline stalled for 60 days
- Agents looping with no recursion limit
- Shared memory state corrupting outputs
- No eval pipeline — every change was a gamble
- Zero production observability
What We Built
- Full orchestration rebuild in LangGraph
- Typed tool results with defined failure modes
- Memory isolation — per-session scoped context
- 150-input evaluation benchmark
- Structured per-step logging and cost tracking
The Results
- Deployed to 500 users in 5 days
- 0 production incidents in 30 days
- Agent error rate: 45% → 0%
- p95 pipeline latency under 200ms
- Full cost and error visibility from day one
“The demo worked perfectly. Production didn't. By the time we understood why, two months had passed and we still weren't live.”
— Founder, HealthFlow (anonymised)
Why the Demo Worked and Production Did Not
The HealthFlow team had built a 3-agent pipeline that extracted key clinical information from patient intake documents, summarised findings, and flagged anomalies for review. In a controlled environment, it performed well. Six failure modes prevented it from surviving contact with real users.
No Evaluation Pipeline
Every prompt change was shipped blindly. There was no benchmark, no test set, no way to know if a tweak improved or regressed performance.
Impact: 3 silent regressions shipped in 6 weeks — discovered only when users complained
Agent Loops
A misconfigured retry condition caused agents to loop indefinitely on ambiguous inputs. No recursion limit was set.
Impact: Single looping request consumed 100% of the hourly token budget in under 4 minutes
Memory Contamination
Agents shared a global memory store rather than per-session scoped contexts. One user's document context leaked into another's responses.
Impact: HIPAA compliance risk — patient document summaries visible across sessions
Missing Fallback Chain
The 3-step pipeline had no fallback at any stage. A timeout on step 2 broke step 3 with no graceful error state returned to the user.
Impact: 100% of requests during a 12-minute API degradation returned unhandled 500 errors
Zero Observability
No structured logging, no cost tracking per request, no alerting. Failures were discovered via Slack messages from users, not monitoring.
Impact: Average time-to-detect a production failure: 47 minutes
Prompt Brittleness
Prompts were tuned on 12 internal test documents. Real users submitted PDFs with scanned text, tables, and non-English content — none of which the prompts handled.
Impact: Over 30% of real-user inputs produced hallucinated or empty summaries
The Compounding Effect
Each of these failure modes existed independently but interacted in production to create a system that was unreliable in ways that were hard to reproduce and impossible to debug without proper tooling. The team spent 60 days patching symptoms rather than understanding the structural cause — because they had no observability to find the structural cause.
A Production-Grade Rebuild in 5 Days
We did not patch the existing system. A structural diagnosis showed that the original architecture had no path to production reliability — it needed to be rebuilt on correct foundations. The sprint was scoped to three deliverables: a production-safe orchestration layer, an evaluation pipeline, and full observability.
LangGraph Orchestration Rebuild
Complete rewrite of the 3-agent pipeline using LangGraph with production-safe patterns.
- Typed ToolResult model with success/error/retryable fields
- Recursion limit of 15 steps enforced at graph level
- Per-session memory isolation via scoped state objects
- Fallback chain: primary model → smaller model → graceful error
- Structured retry-with-exponential-backoff on all tool calls
Evaluation Pipeline
150-input benchmark built from real anonymised user documents — the first automated regression gate in the system.
- 150 curated real-user document inputs
- Automated scoring on extraction accuracy and summary quality
- CI integration — every prompt change runs the full benchmark
- Regression threshold: any drop >2% fails the build
- Weekly benchmark expansion as new edge cases are discovered
Full Observability Stack
Structured logging and cost tracking on every agent step — the visibility layer that should have existed from the start.
- Structured JSON logs per step: agent_id, run_id, tokens, latency, cost
- Real-time cost tracking per request and per user
- PagerDuty alert on error rate >2% over 5-minute window
- Grafana dashboard: agent runs, failures, cost per feature
- Time-to-detect target: under 2 minutes (from 47 minutes)
Sprint Timeline
From first commit to 500 users live — 5 days.
Triage & Reproduction
- ›Reproduced all 6 failure modes in isolation
- ›Classified root causes into infrastructure / logic / data buckets
- ›Defined blast radius — confirmed memory contamination scope
- ›Scoped sprint deliverables and timeline
Rebuild & Eval Pipeline
- ›LangGraph graph rewrite with typed tool results
- ›Memory isolation architecture
- ›Retry / fallback chain implementation
- ›150-input benchmark construction and baseline scoring
Observability & Testing
- ›Structured logging on all agent steps
- ›Cost tracking instrumentation
- ›Grafana dashboard and PagerDuty alerts
- ›Full regression suite passing at 100%
Deploy & Monitor
- ›Staged rollout: 10% → 50% → 100% of users
- ›Real-time monitoring during rollout
- ›Zero incidents across full 500-user deployment
- ›Handover documentation and runbook delivered
What Changed in 30 Days
The first 30 days after deployment produced zero production incidents, zero agent loops, and zero user-reported failures — a clean break from the previous 60-day stall.
Reliability
- Agent error rate: 45% → 0%
- 0 production incidents in 30 days
- 0 agent loops triggered
- 3 prompt regressions caught before shipping
Performance
- p95 pipeline latency: 1,800ms → 195ms
- p50 pipeline latency: 620ms → 88ms
- First meaningful response: <1 second
- Throughput: 10 concurrent requests supported
Operations
- Time-to-detect failures: 47 min → <2 min
- Full cost visibility per user and request
- Benchmark coverage: 150 real-user inputs
- On-call runbook: delivered with handover
Before — 60 Days of Stall
- 45% agent error rate, mostly silent
- Agents looping until token budget exhausted
- Patient document context leaking across sessions
- 47-minute average time-to-detect a failure
- Every prompt change a manual gamble
- No cost visibility — bill arriving as a surprise
After — 500 Users, 0 Incidents
- 0% agent error rate for 30 consecutive days
- Recursion limit enforced — loops impossible
- Per-session memory isolation — no cross-contamination
- Under 2-minute time-to-detect via PagerDuty
- Every prompt change benchmarked before shipping
- Real-time cost dashboard per user and feature
Stack Used in This Engagement
Orchestration
- ›LangGraph (agent graph)
- ›Python 3.11
- ›Pydantic (typed tool results)
- ›Custom retry-with-backoff layer
AI & Models
- ›OpenAI GPT-4 (primary)
- ›GPT-4o-mini (fallback)
- ›Pinecone (vector store)
- ›text-embedding-3-small
Backend
- ›FastAPI
- ›PostgreSQL (session store)
- ›Redis (context cache)
- ›Celery (async tasks)
Infrastructure
- ›AWS ECS (Fargate)
- ›AWS RDS PostgreSQL
- ›CloudWatch Logs
- ›GitHub Actions CI/CD
Observability
- ›Grafana (dashboards)
- ›PagerDuty (alerting)
- ›Structured JSON logging
- ›Cost tracking per request
Evaluation
- ›Custom eval harness (Python)
- ›150-input benchmark corpus
- ›CI-gated regression threshold
- ›Weekly benchmark expansion
What This Case Study Teaches
Ship the eval pipeline before you ship the feature
The single biggest cost of the 60-day stall was not the broken code — it was the team's inability to know if their patches were working. An eval pipeline is not optional infrastructure. It is the tool that makes iteration safe.
Observability is not a post-launch concern
A 47-minute time-to-detect is a 47-minute user-impact window. The logging and alerting that most teams deprioritise as "nice to have" is what separates a recoverable incident from a customer churn event.
Proximate and structural causes are different problems
The agent loops were a proximate cause. The absence of recursion limits was a structural cause. Fixing the loop without adding the limit guaranteed the problem would return. Every hotfix needs both layers addressed.
Memory isolation is a compliance issue, not just an architecture one
For health-tech, shared agent state is not merely an engineering problem — it is a HIPAA exposure. Treating memory isolation as a security requirement rather than a performance optimisation changes how urgently it gets built.
Demos and production test different things
A demo tests the happy path with known inputs. Production tests the full input distribution of real users. The gap between these two is where most AI systems fail. Bridging it requires deliberate investment in edge-case testing.
Fixed-scope specialist sprints outpace open-ended hiring
Hiring a senior AI engineer to solve this would have taken 3–6 months and cost 6–10× more than the sprint. For a specific, well-defined problem, specialist execution on a fixed scope is the faster and cheaper path.
Your AI Stall Has a Fixed-Price Fix
If your AI pilot has been stalled for more than 30 days, submit a problem description. We will review it and return a scoped recovery offer within 24 hours.
Diagnosis in 4 Hours
Triage call and root-cause classification within 4 hours of submission.
Fix in 24–48 Hours
Hotfix deployed with regression tests and observability included.
Fixed Price, No Surprises
Scoped engagement with a single price agreed before we start.
No retainer. No open-ended engagement. One problem, one price, one timeline.