Why AI Projects Fail After the Demo

The demo went well. It usually does.

The client's product catalog parsed cleanly. The AI pulled accurate stock levels, summarized recent orders, answered edge-case questions without flinching. Twenty minutes, no errors, convincing. Six weeks later in production, the same system was timing out on every third request, occasionally returning specs for products that didn't exist, and had already cost more in API calls in week two than the entire pilot had cost to build.

Nothing changed in the code. The system was doing exactly what it did in the demo.

That's the pattern.

Demos don't test what production breaks on

The first problem is structural. A demo is built around the happy path. You pick representative data, write prompts for the scenarios you've thought about, and run through those scenarios in front of someone who's already inclined to think this will work.

Production is everything else. Malformed inputs. Requests at 3am when the rate limiter is still sitting on developer tier. A customer who uploads a PDF from 2009 with three-column OCR artifacts and asks the system to extract delivery dates from it. The LLM doesn't fail gracefully on those cases — it improvises.

When a system has no mechanism for saying "I don't know," it fills the gap with something plausible. That's not a model bug; it's a design gap in the system around it.

The cost nobody plans for

Demo environments run at demo scale. Ten queries in a test session don't stress an API. Ten thousand queries in a production week do.

There's a documented pattern here: an e-commerce workflow that processed 1,000 documents cleanly during a pilot — roughly $45 in API costs — hit $11,000 in week two when it ran against real production volume at 50,000 documents per day. Same code. The difference was production data's long tail of edge cases, and edge cases send token counts up.

This is why semantic caching and tiered model routing matter before the system hits production, not after the first bill arrives. Using a flagship model for a classification task that a smaller, cheaper model handles correctly at a fraction of the cost isn't a trade-off — it's just waste that accumulates invisibly until it doesn't.

The data assumption

Demos run on data you prepared. Production runs on data accumulated over years by people who had other priorities than making it machine-readable.

I've connected AI systems to WooCommerce installations where half the product descriptions were empty, GA4 setups where conversion events were firing twice, Klaviyo lists with tens of thousands of addresses that hadn't engaged since 2021. The model sees all of it, treats it as valid input, and produces outputs weighted accordingly.

Without a pipeline that validates and cleans data before it reaches the model — or at least makes the model aware of confidence levels — the output is only as reliable as the worst record in the dataset. Demos rarely contain the worst record in the dataset.

The permissions problem

This one is less technical and more expensive when it goes wrong.

Earlier this year I wrote about a specific incident: an AI agent with full API access to a client's e-commerce stack executed a set of underspecified instructions correctly — meaning it did exactly what it was told — and in doing so deleted Klaviyo lists, modified product data, and broke the checkout flow simultaneously. The agent wasn't malfunctioning. It was following instructions that hadn't been written carefully enough, with access that was broader than the task required.

The failure mode isn't the AI doing something unexpected. It's the AI doing something precise, on bad inputs, with unconstrained write permissions.

Minimum-access API keys. Staging environments that mirror production data. At least one human review step before any write operation becomes permanent. None of this is complicated, and none of it shows up in demos because demos don't need it.

What RAG actually solves

RAG gets mentioned as an acronym in most AI pitches. The reason it matters is concrete: LLMs without grounding in current, specific data produce answers that are plausible but not necessarily accurate. Connecting the model to a retrieval layer — a vector database populated with your actual documents, your actual policies, your actual product data — is what moves the output from "sounds right" to "is right."

Beyond retrieval: evaluation. Production needs a way to catch hallucinations before a user does. Running a second pass that audits outputs against known ground truth before returning them isn't elegant, but it works. You can implement it as a dedicated evaluation agent, as a structured fact-check against a reference document, or as a human escalation trigger when confidence scores fall below a threshold.

The demos that turn into production systems are the ones where someone asked, before writing the first prompt: what happens when this is wrong, and who catches it?

Why it keeps happening

The AI demo is optimized to produce a decision — proceed or don't. Once the decision is proceed, the actual work of building something that survives contact with real data and real users begins. It's a different kind of work than arranging API calls and writing prompts, and the timeline for it rarely appears in the slide that shows the demo.

Most of the projects I've seen stall after the pilot weren't bad ideas. They were good ideas that got treated as finished once they looked good in a room.

Before the demo ends, it's worth asking: who's responsible for what happens in week six?