AI agent lifecycle after 1.5 years in production

Filip showed us, through the real story of one agent over a year and a half in production. How he mapped the client's know-how, how he tuned accuracy in document categorization, how the production rollout went, and how he systematically monitors and improves the agent today. Among the most interesting parts were definitely the complications that came up during the cycle, mainly around integrations and security.

AI agent lifecycle after 1.5 years in production

Most AI agent stories end at the demo. Ours starts there.

Very few companies in Central Europe run an AI agent that has lived in production for 18 months. We do. This is the story of one of them: a document-classification agent at a Slovak utility distributor, built in 2024, still running today, now scaled across three divisions. The numbers are real, the failures are real, and the lesson is the one no demo will ever show you.

The number that matters most. On the test set, the agent caught 100% of critical documents. Two weeks into production, that dropped to 60%. Today it sits at 99%. The distance between those three numbers is the entire job.

The case: 25,000 documents a year, 0.4% of them critical

The agent reads building permits and decisions from construction offices. Volume is 25,000 documents a year, roughly 100 a day. Each one has to be sorted.

The first split is the one that carries risk. A document is critical when the construction it describes threatens the client’s infrastructure. Critical documents are only 0.4% of the total. The other 99.6% are non-critical, and those still have to be routed into 23 categories, then either archived to SharePoint or forwarded to the right person in Outlook.

The hard part is that the deciding detail can sit anywhere. A document can be one page or 150 pages, and the line that makes it critical can be on page 149. The inputs are scanned, written by different offices in different styles, sometimes at low resolution. The decision process lives in the head of one expert who has never written it down.

That is why this is an AI problem and not a set of keyword rules. The variability is too large to capture in a dictionary. The final critical-document prompt ended up holding 12 distinct rules, not one exact-match term.

The lifecycle has eight phases, and most of the work is after launch

We split an agent’s life into eight phases:

  1. Discovery and triage. Understand what the client actually needs, and whether it is worth doing.
  2. Feasibility and risks. Can it be built, and what breaks if it is.
  3. Design and implementation. The build.
  4. Validation and testing. Measure whether what we built is good enough. Phases 3 and 4 loop.
  5. Production rollout.
  6. Monitoring and operations.
  7. Drift and retraining. The agent does not adapt to new document types on its own. We make sure it does.
  8. Decommissioning.

The temptation is to treat phases 1 through 5 as the project and the rest as maintenance. The opposite is true. The agent that ships is a living system that earns most of its value after it goes live, not a finished product you hand over and walk away from.

Why accuracy is the wrong number

Here is the trap. On a clean test set, the agent scored 92.5% accuracy. Report that to a board and most people nod. It is the wrong number to nod at.

Accuracy tells you how often you were right overall. With critical documents at 0.4% of volume, you can score high on accuracy while missing the few cases that carry all the risk. Two metrics matter more:

  • Precision asks: of the documents flagged critical, how many really were? A false positive here means extra work. Someone checks a document that did not need checking.
  • Recall asks: of the documents that really were critical, how many did we catch? A false negative here is the dangerous one. A missed critical document means a real risk to infrastructure goes unseen.

For this case, recall is the metric that carries the value. It is the same logic as a tumour-detection system reading MRI scans. You do not lose sleep over the healthy patients flagged for a second look. You lose sleep over the sick patient the system missed.

So we set the targets deliberately. 98% recall on critical documents, because the human did 96 to 97% and we wanted to beat that. 70% precision, accepting that we would flag some false positives rather than miss a real one. 90% accuracy on the 23 categories. Pushing recall and precision both higher is possible, but it gets expensive fast, and beyond a point tuning one starts to break the other.

100% to 60% to 99%: the journey of the numbers

This is what no demo shows you.

  • Test set (150 to 200 documents): 100% recall, 80% precision, 96% accuracy. A small sample. We knew something would shift in the real world.
  • First two weeks in production: recall fell to 60%. Unacceptable. Below the human, and below the threshold the client could trust.
  • Three months of tuning: about 3,500 documents flowed through the agent. Every two weeks the client sent back the ground truth, we analysed the failures, found the patterns, and fed them into the prompt. Recall climbed 60, then 81, then back near 100.
  • Today, 18 months in: 99% recall on critical documents, 75% precision97% accuracy across the 23 categories. Now built on thousands of examples, not 200.

One discipline made the recovery possible: we quantified the failure types. When 400 documents came back wrong, we did not fix 400 problems one by one. We found the repeating patterns. One cluster of 100 failures traced back to a single large builder filing a new kind of project. Fix the pattern, fix a hundred cases.

Running human and agent in parallel is how you ship safely

For the first two weeks, the human and the agent both did the full job. Only the human’s output counted. The agent ran alongside, measured but not trusted yet.

That parallel run is a required part of shipping safely rather than a vote of no confidence in the technology. After three months we trusted the agent enough to switch to a 5% random spot-check, enough to catch recall slipping without re-reading everything.

And recall does slip. Over the following year, two drifts arrived, around June and around October. A new document type, a new rule entering the decision process. Each time we caught it through the spot-check and fed the correction back into the prompt. An agent in production is never done. It is monitored.

The AI was the easy part

If you remember one operational lesson, make it this one: integrations almost always take longer than building the agent.

Connecting to the document system, to Outlook, to SharePoint, is 20% technical and 80% coordination. In a large organisation, approving a SharePoint connection can mean moving through three departments and waiting months. It is a managerial project, sometimes a political one.

This shapes how we sequence the build. We de-risk the model first. The risk lives in the AI, not the plumbing. There is an old line about building a monkey that sings on a pedestal: you can build the pedestal and feel half done, but teaching the monkey to sing is the hard part. So we pilot the prompts and models on a meaningful sample before we touch a single integration. If the accuracy is not reachable, we stop, and we have saved the client the cost of everything downstream.

The math that makes it worth it

The agent saves roughly 70% of an employee’s time on this work. That was enough to justify it at 3 full-time equivalents, which is why it now runs across three divisions of the mailroom.

Running cost is predictable. An A4 page holds around 400 words, roughly 600 tokens. At about €105 per million tokens, the monthly model cost is a straightforward calculation against document volume. Platform-based pricing runs higher, on the order of 15 to 20 cents per page, which is exactly the trade-off worth modelling before you commit.

When the math stops working, you decommission. We have already retired one agent because the business process it served shrank below the line where the agent paid for itself. Retiring it on schedule is the lifecycle doing its job, not a project that failed.

Useful beats perfect

The model layer keeps moving. The 2024 build used GPT-4o, three prompts, Azure Document Intelligence for OCR, and an agentic platform underneath. Today a multimodal model reads the scans directly, and the separate OCR step drops out. When we migrate a client to a newer or cheaper model, the decision rests on one thing: evaluation. With 18 months of spot-checked examples behind us, we can measure whether the new model holds before we switch.

That is the thread through all eight phases. You cannot manage what you do not measure, and the number you measure has to be the one that carries the value.

Filip’s closing line fits the whole story: useful beats perfect, every single time. A solution that clears the acceptance criteria and runs is worth far more than a perfect one that never ships.

Key takeaways

  • An AI agent that ships is a living system that earns most of its value in the run phase, not a finished product you hand over at the build.
  • Accuracy can hide the failures that matter. For rare-but-critical classification, recall is the metric that carries the value.
  • Expect production to be worse than your test set. This agent went 100% to 60% to 99% over 18 months. The recovery came from disciplined feedback loops, not a better first prompt.
  • Running human and agent in parallel, then spot-checking 5%, is how you ship safely and catch drift.
  • Integrations are 80% coordination, not code. De-risk the model first.
  • ROI here is 70% of an employee’s time saved, justified at 3 FTE, now scaled to three divisions.

FAQ

What is the lifecycle of an AI agent in production?

It runs through eight phases: discovery and triage, feasibility and risk, design and implementation, validation and testing, production rollout, monitoring and operations, drift and retraining, and finally decommissioning. The build phases get the attention, but most of the value is earned after launch, in monitoring, retraining against drift, and the feedback loops that hold accuracy steady.

Why is recall more important than accuracy for document classification?

When the critical class is rare, for example 0.4% of all documents, a model can post high overall accuracy while still missing the few cases that carry the risk. Recall measures how many of the genuinely critical documents you caught. A missed critical document is the costly error, so recall is the metric to optimise, even at the expense of some precision.

How long does it take to stabilise an AI agent in production?

For document-processing agents, typically three to six months. In this case about 3,500 documents flowed through the agent over three months of tuning before it stabilised. The timeline depends on document volume, the number of categories, and how often the underlying process changes.

What is the ROI of an AI document-processing agent?

This agent saves roughly 70% of an employee’s time on the task. That made it worth building at three full-time equivalents, and it now runs across three divisions. Running cost is predictable from token volume, which makes the cost-versus-value case straightforward to model before committing.

Do you keep humans in the loop after the agent is live?

Yes. The agent runs in parallel with a human at first, with only the human’s output trusted. After it proves out, a 5% random spot-check stays in place to catch drift, such as new document types or new rules entering the decision process.

Related Webinars