AI agents are good at understanding intent, navigating nuance, drafting empathetic responses, and resolving complex queries – all without human intervention. But there's a class of task that LLMs are quietly, persistently bad at, and using them for it anyway is costing you accuracy, speed, and customer trust. That task is math(s).
The Numbers Don't Add Up
In June 2025, developer Maxim Saplin published a benchmark study testing leading LLMs on basic arithmetic, including additions, subtractions, multiplications, and divisions of random numbers. Nothing too complicated, just simple mathematical functions.
The best-performing model, OpenAI's o4-mini, only achieved 97.08% accuracy. The worst, a distilled variant of DeepSeek-R1, managed just 10.21%. Do your customers want the right answer 97% of the time? I’m guessing not.
But what's most interesting are the models in the middle – the ones powering most enterprise AI deployments right now:
Claude Sonnet 4 (without extended thinking): 50.00% – effectively a coin flip
GPT-4.1: 48.12%
Gemini 2.5 Pro: 55.83%
These are frontier models, used in production by major businesses. And on basic arithmetic, they're wrong roughly half the time.
This Isn't a Theoretical Problem
To illustrate how this plays out in practice, Saplin asked Grok 3 to list countries by their GDP per capita in Japanese Yen. A reasonable real-world task. The model found the source data, identified an exchange rate, and built a conversion table. So far so good, but then it got virtually every calculation wrong.
If you’re in customer service, you might be thinking that you don’t need your agent to do calculations. But what about if a customer is asking if they qualify for the return window? Your agent will have to work out the gap between today’s date and the order date and make the decision.
We’ve seen examples where agents get today’s date wrong. So even if they can calculate accurately, one of the numbers they are using is wrong. Other questions that are date-dependent could include: "how long ago did this order ship?", "has this SLA been breached?", or "how many days until this voucher expires?"
What about if someone is asking if a sofa will fit through their door? You don’t want to your AI to confidently tell customers that it’ll be fine because they’ve not understood the spatial parameters.
Why Don't LLMs Just Use a Calculator?
Modern LLMs can use tools to handle calculations, such as calling a Python function rather than attempting to compute the answer themselves. In theory, this solves the problem. In practice, as Saplin notes, models "rarely resort to using a tool call... hence the numbers produced by LLM are not trustworthy."
This is a bit like having an employee who's bad at mental arithmetic but owns a calculator – and keeps forgetting to use it. You can add instructions telling them to always reach for the calculator, but you've now introduced a dependency on consistent prompt adherence. If you can't guarantee that the agent will always use the right tool, you can't fully trust the output.
And even when it does use a tool, you've added latency, complexity, and processing cost to something that could, in essence, be a single line of code:
pythonis_return_window_open = (today - order_date).days <= 30
That runs in milliseconds. It is always correct. It costs almost nothing.
Think about running an Excel or Google Sheets formula: you get the right answer almost immediately.
It's Expensive to Be Probably Right
There's another dimension here beyond accuracy. LLMs are computationally expensive. Every token in, every token out, costs money and takes time. As a general rule, the more complex the task, the more tokens it consumes.
At scale, across thousands of daily interactions, those unnecessary compute cycles add up fast. You're paying a premium to get an answer that's probably right. A well-designed system would give you the right answer for a fraction of the cost.
The Right Tool for the Job
If you need to get a nail through a piece of wood, a screwdriver will technically work. But it's much easier to just use a hammer.
LLMs are extraordinarily powerful tools for tasks involving language, reasoning, empathy, and judgement. They're the wrong tool for arithmetic. Using them anyway, and adding prompt engineering to compensate, is the AI equivalent of hammering a screw.
The better approach is to not treat every task as if it requires a language model. A well-designed AI agent should be built with a range of tools: specialist LLM agents for the tasks LLMs excel at, and deterministic code for the tasks that demand certainty.
How DigitalGenius Approaches This
This is the principle behind how DigitalGenius is built. Rather than routing every task through a single generalist LLM, our platform uses an orchestration agent that selects the right tool for each job.
As a result, you’re getting the right data for your agent to make the right decision for your customers. This means your customers get accurate answers every time, not 97% of the time.
DigitalGenius builds AI that works – combining large language models with deterministic tooling under an orchestration layer designed to get the right answer, not just a plausible one. Speak to us today.





