Why prompt-based AI Agents are not suitable for ecommerce brands
8 Oct 2025
8.5 min read
In the world of AI agents, one debate keeps coming up: should we build them with workflows or with prompts?
At first glance, the choice can feel like a matter of style. Workflows are often seen as rigid and old-school, while prompt-based systems feel more dynamic and futuristic. Why map out every step an agent should take when you can simply tell it what to do in natural language?
But that perception is shifting fast. OpenAI’s release of AgentKit – and particularly its “Agent Builder,” described as a visual canvas for creating and versioning multi-agent workflows – shows that even the most prompt-centric LLM providers see the value of structured workflows. In short, the company most associated with prompt engineering is now building tools for workflow design.
At DigitalGenius, we believe the choice isn’t either-or – it’s both. Our platform uses a hybrid approach that combines workflows and prompts, enabling AI agents to handle complex ecommerce processes with accuracy and consistency. This means using specialist LLM agents with defined responsibilities and workflow logic to ensure consistent and correct solutions, with an Orchestration layer defining the next step.
This is important for all ecommerce brands who value customer experience, especially when dealing with things like refunds, order replacements, order cancellations or anything where customers expect an action to be taken.
It applies even more for more high-risk categories such as healthcare and food. Giving customers the wrong information about ingredients, or whether products are suitable for a particular health condition can be catastrophic, and leaves brands open to liability.
Note: much of this applies to businesses outside of ecommerce. This applies to any industry that involves regulation or compliance. Because we are ecommerce experts, we are going to talk about what we know best: resolving customer service queries for ecommerce brands.
What prompt-only approaches get wrong
Inability to handle complexity
Most ecommerce processes are complex: even something as basic as answering a “Where is my order?” (WISMO) query.
When a customer asks that, an AI agent has to make the right decision based on a number of different criteria, including:
When the order was placed
If and when it was dispatched from the warehouse
Which carrier is handling delivery
The origin and destination countries
The latest carrier status updates
The specific product and its expected lead times
Whether the customer paid for standard or expedited shipping
…and more
The agent’s answer should look completely different if the customer asks 5 minutes after checkout versus 5 months later, or if the order is on track versus overdue. Capturing every possible scenario like this inside a single prompt is practically impossible, as you’d end up writing a small novel. And even then, you’d be relying on the AI to follow every instruction flawlessly, every time.
This is why most prompt-based approaches only go as far as sending out tracking links, because it’s a relatively simple process.
Inconsistent outcomes
The longer and more complex a prompt becomes, the more room there is for the LLM to “improvise.” That means two customers asking the same question could receive different answers — simply because the model interpreted the instructions slightly differently.
For ecommerce brands, inconsistency isn’t just inconvenient — it’s risky. It erodes customer trust when people receive conflicting information. And in regulated industries, it’s unacceptable: the information provided must be accurate, consistent, and compliant every single time.
As this comment on Reddit points out about prompt-based systems: “It will respond differently to the exact same prompt. So you need to decide internally if it responds how you want it to 80% of the time is that acceptable? Or does it need to be 100%.”
What would you think about a human agent who answered questions right 80% of the time? Surely an agent who gives the wrong answer 1 time out of every 5 would not be a good addition to your team.
Hallucinations and missed instructions
Anyone who’s spent time prompting an LLM knows the frustration: sometimes it simply ignores instructions; or worse, makes things up entirely. This isn’t just a theoretical risk. Recently, Deloitte had to repay funds to the Australian government after a report generated with an LLM was found to contain significant errors.
The reality is, the longer and more complex your prompt becomes, the greater the chance the AI will miss a critical instruction. This can even be something as explicit as “Do not give medical advice” or “Do not recommend product X to pregnant people.”
This is what often leads teams into an endless cycle of prompt-tuning. You fix one mistake, another appears. You rewrite the prompt, and a new issue emerges. It becomes a game of whack-a-mole — one that gets harder and more unpredictable as complexity grows.
Being helpful, not truthful
Another quirk of LLM behavior is their strong tendency to be helpful even when that means being inaccurate.
We recently tested a customer service AI agent with a simple scenario: we submitted photos of a damaged product. The agent replied confidently, telling us it had reviewed the images and that a replacement item would be shipped.
The problem? There was no confirmation email. No shipping notification. And months later, no replacement ever arrived.
In this case, the agent understood what we wanted but either lacked the ability to take the required action or hallucinated that it had done so. The result was an answer that sounded helpful and reassuring, but wasn’t truthful. This ultimately creates more frustration for the customer and more work for the brand.
Missed edge cases
Assuming that you can account for everything above, and build suitable guardrails to prevent hallucinations, inconsistencies, and missed instructions, what about edge cases? These are the situations that come up occasionally, that you are unlikely to have planned for.
Does your prompt-only agent pass to a human, or does it try to answer them? Given how they usually try to be helpful, it’s likely to try and answer the question, even if it doesn’t know the answer. How certain are you that it’s going to give the right information? Once again, if you are leaving this up to chance, you are taking a big risk with your customers.
Why workflows are important
Let’s be totally clear, LLM-based agents can be great at smaller, simple tasks. But they are just not suitable for complex processes on their own. That’s why most prompt-only agents only send tracking links in response to WISMO queries – because that’s simple and it kind of answers the question. But to actually resolve most WISMO situations requires using workflows alongside LLM-based agents.
As OpenAI puts it; “Until now, building agents meant juggling fragmented tools—complex orchestration with no versioning, custom connectors, manual eval pipelines, prompt tuning, and weeks of frontend work before launch.”
When something goes wrong with prompt-based agents it can be hard to see where the errors occurred. With workflows it is much easier to diagnose and fix.
Workflows also bring predictability. Fundamentally LLMs are still a probabilistic tool – it can behave the same way 99 times out of 100, or 999,999 out of 1,000,000, but there is still a tiny chance it behaves differently one time.
On the other hand, workflows are deterministic, they are much more simple yes/no or if/then functions, meaning you know what they’ll do 100% of the time.
Imagine your Agent needs to know how many days it has been between the order being placed and today. You can ask an LLM, and it will probably give you the right answer. But there are plenty of examples on the internet of LLMs getting basic calculations wrong. Whereas a correctly-coded comparison of dates isn’t going to be wrong.
Therefore the best approach is to balance workflows and LLM agents in a hybrid approach, taking the best of both worlds.
This is how our hybrid approach works.

The Orchestration agent detects the intent in the customers’ message, and then assigns the right agent for the next task. This could be an LLM response, it could be finding information, or any other number of tasks. This agent is following a workflow that is pre-defined following the brand’s business processes and policies.
This approach means that the interactions feel real, but more importantly, the right outcome – i.e. the outcome the brand has specified – is reached. Ultimately, it means higher accuracy and reliability because the agent is not exercising its own judgement and is following each step in the right order.
When the right response really matters – healthcare and regulated industries
We work with a number of brands who sell health-related products, from spectacles to supplements. It is essential that these brands give customers the right answer each time. Our approach means that brands can be incredibly prescriptive in the response they give if they have to.
This means that within one agent response there can be a generative message, and a scripted one to ensure that the customer is getting the right communications.
If you are selling these sorts of products, you need to think about training an AI agent the same way you would train a human one. You want the agent to be friendly, personable and helpful, but you need it to give the exact right response, to eliminate any chance of liability.
That’s why prompt-based Agents need workflows to keep them in check. A hybrid approach means:
More resolved tickets
Higher accuracy and reliability
Transparent accountability
Measurable resolution rates
Greater scalability across brands and territories
And the same flexibility and tone you want from generative AI
To find out more about how DigitalGenius can help you, speak to our team here.