Essay#13 of 14Filed under AI & automation

The Future of AI Engineering: From Demo to System

The first AI engineering wave was about models. The second wave is about systems — the orchestration, evaluation, and human boundaries around models. Here is what working AI engineering actually looks like in 2026, and where it is heading.

Published: 15 January 2024
Updated: 16 May 2026
Read time: 9 min
Words: 1,752
Tags: ai · engineering · future

The Future of AI Engineering: From Demo to System — cover — The Future of AI Engineering: From Demo to SystemAI & automation

The first wave of AI engineering was about the models. Whose model is best? What benchmark? What inference cost?

That conversation is mostly over. The flagship models from OpenAI, Anthropic, Google, and Meta now bracket the question. The differences between them matter for specific use cases. For most engineering work, you pick one and move on.

The second wave — the one we are in now — is about everything around the model. Orchestration. Evaluation. Tool sets. Boundary design. Cost management. Failure handling. This is where the actual engineering work has migrated, and where most of the value gets created or destroyed.

What changed when "the model" stopped being the bottleneck

For roughly three years (2022–2025), every AI engineering conversation centred on which model to use. The right answer changed every six months. Teams burned engineering quarters re-architecting around the latest model. The industry behaved as if model quality was the only variable that mattered.

That phase ended roughly in early 2026. The models converged into a usable plateau — not because progress stopped, but because the diminishing returns curve flattened for most production work. A 2026 senior engineer reading a 2024 architecture deck recognizes the same shapes underneath the model names. The model is interchangeable now. The system around it is not.

What this means in practice:

The valuable engineering work moved up the stack. Prompt design, tool orchestration, output evaluation, fallback paths, cost routing. Building a system that uses Claude (or GPT-5, or Gemini) responsibly is harder than building one that uses any specific model.
"AI engineer" is the new "full-stack developer." A loose title that covers a wide range of skills, most of which have nothing to do with training models. In 2026 an AI engineer is usually doing system design with LLMs as one component among many.
The bottleneck shifted from capability to evaluation. Knowing whether your AI system is getting better or worse over time is harder than building the system in the first place.

Building an AI system in 2026 is mostly not an AI problem. It is a distributed-systems problem with an inscrutable component in the middle.

SYSTEM — three horizontal layers labeled PRODUCT, PLATFORM, INFRA, with PLATFORM in violet — The 2026 stack. The platform layer — where AI systems live — is where the engineering judgment matters most.

SCALE — ascending staircase from 10× to 100K showing orders-of-magnitude growth, deep indigo background — What changed: a five-year curve of model capability flattened in 2026. The marginal model is now a commodity. The engineering above it is not.

The four engineering disciplines that actually matter now

If I were hiring an AI engineer in 2026 — and I have been, for client projects — these are the four things I screen for. The model knowledge is assumed. The interesting capabilities are below.

1 · Tool set design

Most production AI work in 2026 is not "ask the model a question and use the answer." It is "give the model a set of tools and let it orchestrate." The engineering work is designing the tool set.

A good tool set has:

Five to fifteen functions, not fifty. More tools means slower decisions and more model confusion.
Clear input contracts. Every function has a typed signature the model can reliably produce.
Explicit failure modes. What happens when a tool call fails? When it returns ambiguous results? When the model loops on the same tool five times?
Logged decision traces. Every tool call is recorded with its rationale. You will need this when something looks off in production.

The teams I have worked with that ship reliable AI systems treat tool set design as a first-class engineering discipline. The teams that fail treat it as configuration.

2 · Evaluation at scale

This is the unsolved problem of 2026 AI engineering. Building a system that produces useful output is achievable. Knowing whether the system is getting better or worse over time is much harder.

The standard approach today is some combination of:

Golden dataset evaluation. A curated set of inputs with known-good outputs. Run the system across it on every change. Track pass rate over time.
LLM-as-judge. Use a (usually different, usually stronger) model to evaluate outputs. Reasonable for many tasks, expensive, has its own biases.
Human review sampling. Random 1–5% of production outputs reviewed by humans. Slow, expensive, but uniquely valuable for catching what automated evaluation misses.
Implicit signal. Did the user accept the output? Did they edit it? Did they downvote it? Cheap but lagging and noisy.

No team I have seen does all four well. Most do one or two and pretend that is enough. The teams that ship AI features and then quietly turn them off in month nine are almost always teams that under-invested in evaluation.

3 · Boundary engineering

A non-deterministic system inside a deterministic application is a boundary problem. The model can produce anything. Your application can only accept certain things. Designing the boundary between "model output" and "stuff your application acts on" is where most AI engineering failure happens.

Boundary techniques that work in 2026:

Structured output enforcement. Use the model's native JSON schema mode (Anthropic, OpenAI, Google all support it now). Validate the schema. Reject anything that does not match and retry with feedback.
Confidence-driven action vs. escalation. Below a threshold, the system escalates to a human. Above it, the system acts and logs. The threshold is tuned per task based on real-world cost-of-error.
Idempotency on side effects. Every action the system takes can be safely re-tried. If the model produces two API calls when it should have produced one, the second one is a no-op. This is normal distributed-systems engineering applied to AI output.
Audit trails on everything. Every decision the system makes is logged with timestamp, input, model identity, prompt version, output, and downstream action. You will need this in week three when something looks off.

4 · Cost engineering

AI infrastructure costs scale non-linearly. A small change in prompt size or model choice can move your bill 5×. Most teams discover this in production after the fact.

The discipline that prevents this:

Cost-per-request budgeting. Each AI-using feature has an explicit budget (€0.001 per support classification, €0.02 per content generation). The budget gets enforced in code.
Routing. Different models for different tasks. Cheap models for high-volume routine work, expensive models for high-stakes occasional work. (I wrote about the routing strategy I use in detail.)
Caching. Many AI calls are repeated within minutes or hours. A semantic cache (similar requests share answers) cuts costs dramatically.
Batch processing. Where latency does not matter, batch requests. APIs charge significantly less for batch operations.

Cost engineering in AI looks more like cloud cost optimisation than ML research. It is a discipline most teams discover too late.

PROCESS — engineering schematic Ø 05 INTAKE / ± 0.1 REVIEW / ⌀ 12 OUTPUT with tolerance markings — The four disciplines render the AI engineer's day as an engineering schematic: tool set, evaluation, boundary, cost — all with explicit tolerances.

Where the field is actually heading

Three trends I am confident about, with timeframes attached.

Multi-model orchestration becomes the default (next 12 months)

In 2026, "use one model for everything" is already obsolete. The future is routers that pick the right model per task based on cost, latency, capability, and reliability — and that swap models cleanly when one degrades or repricing happens. The router pattern I described in my AI coding tools post becomes infrastructure.

Evaluation tooling consolidates (next 24 months)

The evaluation-at-scale problem is unsolved enough that a new category of tooling is emerging. Expect 2–3 winners by 2028, similar to how observability consolidated around Datadog/Honeycomb/Grafana. If you are starting a project in 2026, build evaluation as a separate layer from the AI itself so you can swap the tooling later.

On-device AI becomes meaningful for specific products (next 36 months)

Cloud AI has won the general-purpose race. But on-device AI is genuinely better for: voice keyboards, real-time camera processing, offline modes, regulated data, latency-critical interactions. The hardware (Apple Silicon, modern Android chipsets) is finally good enough. The architectural skill is knowing which product needs the cloud and which needs the device.

If you are thinking through where AI fits in your specific product, that is a discovery sprint conversation.

What the AI engineer role looks like in 2028

Speculative section, treat with appropriate skepticism. The trajectory I see based on 18 months of client work:

Less model-tuning, more system design. The job becomes 70% distributed systems and product engineering, 30% AI-specific knowledge.
Evaluation engineer emerges as a distinct role. Like SRE specialized out of software engineering, evaluation engineering specializes out of AI engineering.
The "AI generalist" role consolidates. Right now anyone with six months of LLM experience can call themselves an AI engineer. By 2028 the role bifurcates into platform engineers (building the AI infrastructure) and product engineers (integrating it into products).
Boring infrastructure wins. The teams that ship reliable AI systems in 2028 are the ones who treated AI as a distributed-systems problem in 2026, not the ones who built the cleverest prompts.

Takeaways — what to learn this quarter

Stop chasing the latest model. Pick a current flagship, build a routing layer, move on. The competitive advantage is no longer in model selection.
Invest in evaluation early. A simple golden-dataset evaluator that runs on every change is more valuable than a sophisticated prompt. Build it on week one.
Treat tool set design as engineering. Schema-first, contract-tested, with explicit failure modes. Not "configuration."
Boundary your output. Structured output mode. Schema validation. Confidence-based escalation. Audit logs. These four together prevent 80% of AI-system production failures.
Budget for cost engineering. Add per-feature cost ceilings, semantic caching, and routing as core infrastructure. Discover AI bills early, before they discover you.
Read distributed systems papers, not AI papers. The next two years of AI engineering progress will come from applying boring distributed-systems engineering to AI output, not from inventing new model architectures.

The future of AI engineering looks more like backend engineering than like research. The teams that internalize that early will be the ones shipping working systems in 2028. The teams still chasing the latest model will be on their third migration by then.

Frequently asked

01What is the difference between an AI engineer and an ML engineer?

ML engineers build and train models. AI engineers build systems that use models. In 2026 the ML role has largely consolidated into a small number of model labs (OpenAI, Anthropic, Google, Meta). The AI engineer role has exploded — it is now most of the practical work in the field.

02Will AI replace software engineers?

It will replace the parts of software engineering that involve translating clear requirements into boilerplate code. It will not replace the parts that involve deciding what to build, what to ship, what to cut, and how to handle the messy boundary between systems. The engineering work moves up the abstraction stack, it does not disappear.

03What skills should I learn in 2026 to be relevant in AI engineering?

Three things: how to write good prompts and design tool sets for agents (the new core skill), how to evaluate non-deterministic outputs at scale (the new bottleneck), and the boring fundamentals — databases, distributed systems, observability — that determine whether your AI system actually works in production.

04Is on-device AI the future, or is everything moving to the cloud?

Both, for different problems. On-device wins where latency, privacy, or offline operation matter (mobile keyboards, voice assistants, real-time camera processing). Cloud wins where model capability matters more than constraints (code generation, complex reasoning, multi-step agents). The architectural skill is knowing which boundary applies to your product.

05What is the biggest unsolved problem in AI engineering right now?

Evaluation at scale. We can build agents that ship work in production. We cannot yet measure their quality cheaply, repeatably, and in a way that catches regressions. Every team running AI in production today is paying this tax. The first company to ship a great evaluation framework will quietly become indispensable.

Written by Norbert KovalčínIndependent architect · Europe · CETI help companies own their stack instead of renting it. One client at a time.

Book a 30-min call Send a brief

The Future of AI Engineering: From Demo to System

Owning Your Stack in 2026

AI Chatbots for Small Business in 2026: What Actually Works

How to Hire a Full-Stack Developer in 2026 (Without Getting Burned)

New essay every few weeks.