Filip showed us, through the real story of one agent over a year and a half in production. How he mapped the client's know-how, how he tuned accuracy in document categorization, how the production rollout went, and how he systematically monitors and improves the agent today. Among the most interesting parts were definitely the complications that came up during the cycle, mainly around integrations and security.
Most AI agent stories end at the demo. Ours starts there.
Very few companies in Central Europe run an AI agent that has lived in production for 18 months. We do. This is the story of one of them: a document-classification agent at a Slovak utility distributor, built in 2024, still running today, now scaled across three divisions. The numbers are real, the failures are real, and the lesson is the one no demo will ever show you.
The number that matters most. On the test set, the agent caught 100% of critical documents. Two weeks into production, that dropped to 60%. Today it sits at 99%. The distance between those three numbers is the entire job.
The agent reads building permits and decisions from construction offices. Volume is 25,000 documents a year, roughly 100 a day. Each one has to be sorted.
The first split is the one that carries risk. A document is critical when the construction it describes threatens the client’s infrastructure. Critical documents are only 0.4% of the total. The other 99.6% are non-critical, and those still have to be routed into 23 categories, then either archived to SharePoint or forwarded to the right person in Outlook.
The hard part is that the deciding detail can sit anywhere. A document can be one page or 150 pages, and the line that makes it critical can be on page 149. The inputs are scanned, written by different offices in different styles, sometimes at low resolution. The decision process lives in the head of one expert who has never written it down.
That is why this is an AI problem and not a set of keyword rules. The variability is too large to capture in a dictionary. The final critical-document prompt ended up holding 12 distinct rules, not one exact-match term.
We split an agent’s life into eight phases:
The temptation is to treat phases 1 through 5 as the project and the rest as maintenance. The opposite is true. The agent that ships is a living system that earns most of its value after it goes live, not a finished product you hand over and walk away from.
Here is the trap. On a clean test set, the agent scored 92.5% accuracy. Report that to a board and most people nod. It is the wrong number to nod at.
Accuracy tells you how often you were right overall. With critical documents at 0.4% of volume, you can score high on accuracy while missing the few cases that carry all the risk. Two metrics matter more:
For this case, recall is the metric that carries the value. It is the same logic as a tumour-detection system reading MRI scans. You do not lose sleep over the healthy patients flagged for a second look. You lose sleep over the sick patient the system missed.
So we set the targets deliberately. 98% recall on critical documents, because the human did 96 to 97% and we wanted to beat that. 70% precision, accepting that we would flag some false positives rather than miss a real one. 90% accuracy on the 23 categories. Pushing recall and precision both higher is possible, but it gets expensive fast, and beyond a point tuning one starts to break the other.
This is what no demo shows you.
One discipline made the recovery possible: we quantified the failure types. When 400 documents came back wrong, we did not fix 400 problems one by one. We found the repeating patterns. One cluster of 100 failures traced back to a single large builder filing a new kind of project. Fix the pattern, fix a hundred cases.
For the first two weeks, the human and the agent both did the full job. Only the human’s output counted. The agent ran alongside, measured but not trusted yet.
That parallel run is a required part of shipping safely rather than a vote of no confidence in the technology. After three months we trusted the agent enough to switch to a 5% random spot-check, enough to catch recall slipping without re-reading everything.
And recall does slip. Over the following year, two drifts arrived, around June and around October. A new document type, a new rule entering the decision process. Each time we caught it through the spot-check and fed the correction back into the prompt. An agent in production is never done. It is monitored.
If you remember one operational lesson, make it this one: integrations almost always take longer than building the agent.
Connecting to the document system, to Outlook, to SharePoint, is 20% technical and 80% coordination. In a large organisation, approving a SharePoint connection can mean moving through three departments and waiting months. It is a managerial project, sometimes a political one.
This shapes how we sequence the build. We de-risk the model first. The risk lives in the AI, not the plumbing. There is an old line about building a monkey that sings on a pedestal: you can build the pedestal and feel half done, but teaching the monkey to sing is the hard part. So we pilot the prompts and models on a meaningful sample before we touch a single integration. If the accuracy is not reachable, we stop, and we have saved the client the cost of everything downstream.
The agent saves roughly 70% of an employee’s time on this work. That was enough to justify it at 3 full-time equivalents, which is why it now runs across three divisions of the mailroom.
Running cost is predictable. An A4 page holds around 400 words, roughly 600 tokens. At about €105 per million tokens, the monthly model cost is a straightforward calculation against document volume. Platform-based pricing runs higher, on the order of 15 to 20 cents per page, which is exactly the trade-off worth modelling before you commit.
When the math stops working, you decommission. We have already retired one agent because the business process it served shrank below the line where the agent paid for itself. Retiring it on schedule is the lifecycle doing its job, not a project that failed.
The model layer keeps moving. The 2024 build used GPT-4o, three prompts, Azure Document Intelligence for OCR, and an agentic platform underneath. Today a multimodal model reads the scans directly, and the separate OCR step drops out. When we migrate a client to a newer or cheaper model, the decision rests on one thing: evaluation. With 18 months of spot-checked examples behind us, we can measure whether the new model holds before we switch.
That is the thread through all eight phases. You cannot manage what you do not measure, and the number you measure has to be the one that carries the value.
Filip’s closing line fits the whole story: useful beats perfect, every single time. A solution that clears the acceptance criteria and runs is worth far more than a perfect one that never ships.
It runs through eight phases: discovery and triage, feasibility and risk, design and implementation, validation and testing, production rollout, monitoring and operations, drift and retraining, and finally decommissioning. The build phases get the attention, but most of the value is earned after launch, in monitoring, retraining against drift, and the feedback loops that hold accuracy steady.
When the critical class is rare, for example 0.4% of all documents, a model can post high overall accuracy while still missing the few cases that carry the risk. Recall measures how many of the genuinely critical documents you caught. A missed critical document is the costly error, so recall is the metric to optimise, even at the expense of some precision.
For document-processing agents, typically three to six months. In this case about 3,500 documents flowed through the agent over three months of tuning before it stabilised. The timeline depends on document volume, the number of categories, and how often the underlying process changes.
This agent saves roughly 70% of an employee’s time on the task. That made it worth building at three full-time equivalents, and it now runs across three divisions. Running cost is predictable from token volume, which makes the cost-versus-value case straightforward to model before committing.
Yes. The agent runs in parallel with a human at first, with only the human’s output trusted. After it proves out, a 5% random spot-check stays in place to catch drift, such as new document types or new rules entering the decision process.