Your Team's Prompts Are Not the Problem

Why rewriting the prompt rarely fixes a stuck AI system, and the five things in the architecture that actually decide whether it works.

June 3, 2026
9 min read
Tags
ai-architecture-consultingai-consultingproduction-aisystem-designrag

Your Team's Prompts Are Not the Problem

Most companies that reach out to me about a stuck AI project start the conversation with the same problems:

  • They've tried a few different models.
  • They've rewritten the prompts a dozen times.
  • The system works sometimes, fails other times, and nobody on the team can tell you why.

So, they want me to come in and somehow "improve the prompts”. It’s something I can do, but bad prompting is almost never the actual problem.

When an AI system works inconsistently in production, the prompt is the most visible thing. It's the part the team has direct control over. It's the part that changes when things go wrong. So that's where the attention goes. The team tweaks the prompt, the behavior shifts slightly, the issue moves somewhere else, and the cycle starts again. Six months later, they're on prompt version 47, and the system still doesn't work.

The prompt is rarely the problem. Most of the time, the problem resides in the architecture around the prompt. The prompt just coincidentally happens to be where the symptom shows up, or it is the easiest and fastest way to find something to blame.

This is what AI architecture consulting actually is.

What "architecture" means in an AI system

In a traditional web application, architecture is the relationship between your database, your application server, your front-end, your queues, your cache, and your monitoring. The decisions about what talks to what, how data flows, what fails when, and what you do about it.

An AI system has all of that plus a model in the middle that most of the time lies, sometimes refuses to answer, sometimes returns malformed output, sometimes costs ten times what you expected on a single request, and produces non-deterministic results across runs.

The architecture is the system you build around that model, so the unreliability stops mattering. Specifically:

What the model sees. What context, what data, what instructions, what retrieved knowledge. The model can only be as good as what you put in front of it. Most stuck AI projects fail here.

What you do with what comes back. Output validation, schema enforcement, retries on malformed responses, and fallbacks when the model is unsure. The model will produce wrong output sometimes. The architecture decides whether that wrong output ever reaches a user.

Where the model fits in the flow. Sync vs. async. User-facing vs. background. Real-time vs. cached. A model on the critical path of a checkout flow is a completely different system from the same model summarizing emails overnight, even if the prompt is identical.

How you know it's working. Logging, evaluation sets, regression tests, and alerts when the answer quality drops. Without these, you have no signal. Without a signal, every change is a guess.

What happens when it fails. Because it will. Rate limits, API outages, model deprecations, edge cases that hallucinate. The architecture is what decides whether a failure is invisible to the user or a Monday morning incident.

The prompt sits on top of all of that. A good prompt on top of a broken architecture is a slightly less broken system. A boring prompt on top of a thoughtful architecture is a system that runs in production, and nobody talks about, which is what good infrastructure is supposed to do.

A few patterns I see in stuck AI projects

"The model gives different answers to the same question." Sometimes this is the prompt. More often, it's that the retrieval layer is pulling different contexts each time, the conversation history is being truncated differently, or the temperature is set to something that introduces randomness for no good reason. The fix is upstream of the prompt.

"The model confidently makes things up." The prompt-engineering instinct is to add "do not hallucinate" or "only use information you are sure about" to the system prompt. This does not work. Hallucination happens when the model doesn't have enough context to answer correctly, so it fills the gap. The fix is to either give it the right context (retrieval) or to detect when it's outside its competence and refuse to answer (guardrails).

"It works in testing and fails in production." Almost always a context problem. The test inputs were clean and representative. The production inputs include emoji, mixed languages, sarcasm, malformed JSON, and customer rants that span six messages. The prompt was tuned against the easy cases, but the architecture needs to handle the real ones.

"Costs are blowing up, and we don't know why." Usually one of three things: the conversation history is being sent in full every turn, the retrieval layer is returning too much irrelevant context, or someone added a "be thorough" instruction that doubled the average response length. None of these are prompt problems.

"The model is slow." The model is rarely slow. The system around it is making sequential calls when it could parallelize, waiting for full responses when it could stream, or running embeddings on every request when it could cache. The architecture is the bottleneck, not the inference.

In every one of these cases, the team's instinct is to rewrite the prompt. Sometimes the new prompt helps. The relief is temporary. Until the problem reappears.

What architecture consulting actually looks like

When I'm hired for this, the first week or two is usually not building anything. It's reading.

I read the existing system end-to-end. I look at what the model receives on a real request, not what the team thinks it receives. I look at how the output is parsed, validated, and used downstream. I look at where errors are caught and where they're swallowed. I look at the logs, when they exist, and ask why they don't, when they don't. I look at the cost breakdown by request type. I look at the eval set, if there is one. I almost always find that there isn't.

This is the part that feels slow to clients who hired me to fix something. They wanted a code change. What they're getting is someone who knows their system better than they do, asking questions they didn't think to ask. A week in, I usually have a list of things that need to change, ordered by what will help most with the least disruption.

The recommendations are almost never about the prompt. They're about flow, validation, retrieval, evaluation, caching, and observability. Sometimes the prompt does need work, but it shows up at position four or five on the list, not at position one.

The second phase is implementing the changes. Sometimes I do this. Sometimes the client's team does it with me reviewing. Sometimes it's a mix. But documenting and being clear about who is doing what and how it is being done is extremely important.

The third phase is the handoff: Documentation, runbooks, the eval set, and/or a maintenance plan. Without this, the system will degrade as the team makes changes without understanding what they touched, and they'll be calling me back in six months for the same reasons they called the first time.

When you need AI architecture consulting and when you don't

You don't need it if your system works, your costs are predictable, your team can explain what's happening when something goes wrong, and you have a way to know if a change makes things better or worse. That's a healthy system. Keep doing what you're doing.

You do need it when the system almost-works, and nobody on the team can tell you why. When the prompt has been rewritten a dozen times. When the team is afraid to touch anything because they don't know what will break. When the demo was great, and production is two months past the date you expected to ship.

The thing nobody tells you about AI architecture

Most of the work is not AI work.

It's the same work you'd do on any production system for a single app or website. Logging. Validation. Caching. Error handling. Retries with backoff. Schema enforcement. Monitoring. Observability. The discipline of building software that survives contact with users.

The AI part (the model, the prompts, the retrieval) is maybe 20% of the system by line count. The other 80% is the infrastructure that makes the 20% reliable. If your team is great at the 80% and learning the 20%, you probably don't need me. If your team has been treating the 20% as the whole system, that's what an AI architecture engagement actually fixes.

I have written about this from a different angle in the post about AI agents with too many permissions, and from yet another in the WordPress agencies post. In each post, the same pattern can be seen: the team focused on the AI part and underbuilt the system around it. The fix is always architectural.

What you should do before hiring anyone

Before you bring in a consultant, write down what you know and what you don't.

  • What does the system actually do, end to end?
  • What inputs come in?
  • What does the model see on a typical request?
  • What comes out?
  • What do you do with the output?
  • Where do the failures happen?
  • What have you tried?
  • What didn't work and why?

Most teams cannot answer those questions in writing. The act of trying to answer them is sometimes enough to surface the problem without hiring anyone. If you write it all down and the gaps are obvious, fix the gaps yourself. If you write it all down and the system still doesn't make sense, that's when you call someone.

If your AI system is at the point where the prompt has been rewritten more times than anyone wants to admit and the team is starting to wonder if the whole thing was a mistake, that's the architecture work I do. Paid discovery, written scope, real handoff. The output is a system that stops being interesting, which is the goal.

What does your current architecture diagram leave out?

Read More Posts

Explore other articles and insights

Back to Blog