What 34 Implementations Taught Us About Delivering AI into Production
By Filip Staňo · March 2026 · Based on Ableneo External Tech Workshop vol. 45
Most companies are not failing because they lack access to AI. They fail because they cannot get AI to work inside real systems, with real data, under real constraints.
This was the premise of our 45th External Tech Workshop, where we opened the floor on 34 AI projects delivered over the course of 2025. Not a product demo. Not a strategy deck. A structured walkthrough of what we built, where it broke, and what we would do differently.
The session covered four project deep dives, a statistical overview of our delivery portfolio, and a candid list of ten lessons distilled from a year of production-grade AI engineering, with live Q&A from 60 attendees in the room and an online audience joining remotely.
Before diving into individual cases, the numbers tell their own story. Across 2025, our AI engineering team delivered 34 solutions for 26 clients in 10 industry segments.
The largest concentration was in financial services. Banking accounted for 7 projects, insurance for 6, and telecom for 5. An additional 3 projects fell into broader financial services, and 3 into energy. Manufacturing contributed 3 more. The remaining projects spanned several other verticals.
Half of all implementations landed in regulated industries where governance, data sensitivity, and compliance are not optional considerations. They are constraints that shape every architectural decision from day one.
We categorized the solutions into three groups. 22 of the 34 projects were automation agents or workflow-based systems. These are processes that run in the background: an invoice arrives, data is extracted, validated against a system of record, and routed further. No human interface needed.
8 projects were interactive agents, chatbots with knowledge bases, analytical assistants, and conversational interfaces built on RAG architectures. 3 were complex AI applications combining automation with user-facing interfaces, document audit trails, and human-in-the-loop verification. 1 was a dedicated on-premises LLM deployment.
13 solutions ran on SaaS platforms such as FlowHunt or UiPath, where the processing is partially low-code with some programmatic customization. 11 ran on Azure, mostly written in Python, using Azure OpenAI and Azure Document Intelligence as core components. 6 ran on AWS. 4 involved locally deployed models on client-owned hardware.
32 of the 34 projects used large language models. Only 2 involved custom-trained models. The business case for LLM-based solutions was simply stronger in 2025. They had faster ramp-up times, were easier to deliver, and started generating value much sooner than bespoke model training.
We observed no meaningful performance difference between OpenAI, Anthropic, and Google models for the extraction and classification tasks we handled. The deciding factor in choosing a provider was usually the client’s existing cloud infrastructure, not model performance.
One of our telecom clients sat on terabytes of recorded customer calls. The data existed, but it had never been systematically analyzed. The question was straightforward: what are customers actually calling about?
The system needed to transcribe recordings, extract structured data from the transcripts, and expose that data through a conversational interface. A business user should be able to ask a question like “What problems do corporate clients in Bratislava most frequently report?” and receive a precise, data-backed answer.
Behind this sits a translation problem. Natural language questions must be converted into correct database queries. That conversion is considerably harder than it sounds, especially in Slovak, where semantic mapping between conversational phrasing and structured data columns is imprecise.
We delivered a systematic extraction pipeline that pulls structured attributes from call transcripts and stores them in a queryable format. On top of that, we built a chat application that translates user questions into database queries and returns answers.
The system supports persona-based access. A product manager, for example, sees a pre-tuned set of query types optimized for their role. Expanding the system to handle wider question types is an ongoing effort.
The biggest gap was the absence of a ground truth dataset. Without a validated baseline of correct answers, it was extremely difficult to evaluate whether the system was performing well. We would insist on establishing an evaluation dataset from the start if we were to do this again.
This pattern is not limited to call analytics. The same technology applies to emails, contracts, internal messages, and any unstructured text archive that was never designed for analysis. Large language models made it economically viable to extract value from data sources that were previously too expensive to process.
In the insurance sector, we built a document management application for risk reports. These are dense documents, often 120 pages, describing all potential risks for insuring a property or industrial facility.
The client needed 40 specific paragraphs extracted from each report. Each paragraph had to follow a defined structure, and every claim in the output had to be traceable to a specific section, or even a specific image, in the source document.
The documents varied in structure. Different risk engineering firms across different countries produced reports in different formats. There was no reliable template to parse against.
The application lets a user upload a document and receive a structured summary. Each extracted paragraph links back to the exact source location in the original file. If the information came from an image, a visual indicator points the user to the relevant page.
This traceability was essential. The underwriter who uses the tool remains responsible for the accuracy of the output. The application does not replace their judgment. It accelerates their review by approximately 90 percent. When a user can verify any claim in seconds by clicking through to the source, trust in the system becomes practical, not theoretical.
For documents exceeding the LLM context window, we process them incrementally. We read the document in segments, progressively filling a matrix of required data points. When conflicting information appears across segments, a resolution algorithm uses surrounding context to select the correct value.
Our longest prompt for a single extraction task reached 1,400 lines. Even for “simple” data extraction, the complexity of real-world documents demands detailed, specific instructions.
For this project, we used document intelligence services rather than multimodal LLMs for image-based data. The reason was traceability. Document intelligence tools return precise location metadata, telling us exactly which bounding box on which page produced a given piece of information. That metadata powers the source referencing feature. For pure data extraction without location tracking, multimodal LLMs are increasingly viable and sometimes simpler.
For a telecom client, we built a RAG chatbot that needed to stay current in near real-time. The knowledge base monitored a SharePoint repository, indexing new documents as they appeared and removing outdated ones at regular intervals.
People tend to assume that building the initial knowledge base is the hard part. In practice, maintenance is harder. Documents about service outages, for example, are relevant for a limited time. After an issue is resolved, the information must be removed to prevent the chatbot from returning stale answers.
The system watches the document source, detects additions and deletions, and updates its vector database accordingly. Currently, a human operator still manages the lifecycle of time-sensitive content, flagging documents for removal when they are no longer relevant.
This project reinforced a pattern we saw across many implementations. The quality of the AI output is directly tied to the quality of the underlying data. When the knowledge base is clean and current, the chatbot performs well. When it is stale or inconsistent, performance degrades regardless of how good the model is.
One client, a partially government-affiliated Austrian organization, required full data sovereignty. No cloud. No external API calls. Everything on-premises.
We set up DeepSeek and Whisper models on two NVIDIA H100 GPUs with 94 GB of VRAM each. The primary use case was transcribing audio recordings from meetings in German and enabling conversational search over the transcripts.
On a single GPU, the setup comfortably supported around 30 concurrent users with a context window of approximately 10,000 tokens each. With two GPUs, we scaled to 100 concurrent users, but the gain was in throughput, not latency. The response speed per user stayed roughly the same.
The key insight was that additional GPUs improve parallelism, not individual request speed. Planning hardware purchases requires knowing three things: the workload (document sizes), the concurrency target (simultaneous users), and the acceptable response time.
The hardware purchase is only the beginning. Ongoing costs include infrastructure maintenance, model lifecycle management (new models emerge every few months), energy consumption, and dedicated staff to operate the system. Hidden costs include vendor lock-in to specific GPU generations, delayed adoption of newer models, and the internal capacity required to keep the system running.
On-premises deployment makes sense when regulatory requirements demand it, when data sensitivity genuinely justifies it, and when request volumes are high enough to offset the capital expenditure. Otherwise, cloud infrastructure is the faster and more flexible option.
Start with a clearly defined business problem. Not with a search for where to apply AI. Many organizations hand teams a tool and ask them to find a use for it. That approach rarely produces meaningful results. The successful projects in our portfolio all started from a specific, measurable pain point.
Whether we used GPT-4, Claude, or Gemini made far less difference than the quality and representativeness of the data. The data must reflect real production conditions. Synthetic test sets are useful for iteration, but they must correlate with actual performance. We measured this correlation explicitly, and it held.
Several projects launched on samples of just 50 to 100 documents. That is not enough to cover all edge cases. These solutions require ongoing monitoring, regular evaluation cycles, and a mechanism to feed new examples back into the system. Build for iteration from the start.
Local deployment is appealing in theory. In practice, the costs extend far beyond hardware. Factor in maintenance staff, model refresh cycles, energy, and the opportunity cost of slower innovation. Calculate carefully before buying GPUs.
The projects with the best trajectories all established evaluation frameworks at the outset. A clear metric, a test dataset, an agreed definition of what “correct” looks like. For chatbot solutions, we built a dedicated evaluation service that generates synthetic question-answer pairs, tracks retrieval quality, and aggregates errors by frequency rather than recency. We use LLM-as-judge approaches for comparing generated outputs to expected answers, replacing older metrics like ROUGE and BLEU which proved less reliable.
Teams frequently demand 100 percent accuracy from AI before they will adopt it. Multiple studies show that human error rates in manual data entry run between 3 and 5 percent. A system performing at 97 percent accuracy is already at human level. Set realistic benchmarks and build verification mechanisms rather than waiting for perfection.
Quantifying the actual business impact of an AI solution requires a robust methodology. In some cases, such as call center deflection, the measurement is straightforward. In others, the benefit is diffuse and difficult to isolate. Plan for measurement from the beginning. If you cannot measure it, you cannot defend it.
This was consistently our biggest bottleneck. We delivered working solutions with high accuracy that then waited months for integration into corporate systems. Access gateways, security reviews, API approvals. The technical development is often the easier half of the project.
The most mature clients settle governance questions before development begins. Data classification, risk categorization, AI act compliance assessment. When security is treated as an afterthought, it becomes the project’s primary blocker. When addressed from day one, it becomes an enabler.
Across 34 projects, we adapted 34 times. No two clients had the same data formats, infrastructure, regulatory requirements, or organizational maturity. There is no universal playbook. Strategy and approach must be tailored to each specific context. The variability is the work.
The majority of our 2025 portfolio consisted of automation agents that process documents, invoices, and contracts in the background, plus interactive chatbot and RAG systems for knowledge retrieval. We also build complex AI applications with human-in-the-loop verification and deploy models on local infrastructure when required.
We work with OpenAI, Anthropic, and Google models, selected based on the client’s existing cloud infrastructure rather than model benchmarks. For the tasks we handle, primarily extraction, classification, and summarization, performance differences between providers are negligible.
We use a dedicated evaluation service that generates synthetic test data, measures retrieval quality in RAG systems, and uses LLM-as-judge for comparing outputs. We track error frequency and categorize failures by type (retrieval failure vs. generation failure) to prioritize improvements systematically.
On-prem is justified when regulatory requirements demand data sovereignty, when data sensitivity genuinely warrants it, and when request volumes are high enough to offset the capital and operational costs. For most use cases, cloud-based deployment offers faster innovation and lower total cost of ownership.
Yes. Our consulting team works with organizations to identify which processes are suitable for AI, map out an adoption roadmap, and assess organizational readiness. For one of the largest Slovak enterprises, we conducted a full AI transformation audit across the organization.
This article is based on Ableneo External Tech Workshop vol. 45, held in March 2026. The session was led by Filip Staňo with contributions from Andrej and the Ableneo AI engineering team. 34 projects, 26 clients, 10 industries, one year of production delivery.