Essay#09 of 14Filed under AI & automation

AI Automation: The Practical Guide

Every week someone shows me a 20-minute video of an AI agent autonomously building a SaaS. The video always stops before anyone looks at the code. Here is what actually ships — three production patterns, real code, real cost numbers, no demos.

Published: 02 March 2026
Updated: 16 May 2026
Read time: 9 min
Words: 1,322
Tags: ai · automation · mcp

AI Automation: The Practical Guide — cover — AI Automation: The Practical GuideAI & automation

Every quarter, somebody shows me a video of an AI agent autonomously building an entire SaaS product in twenty minutes. Usually the video stops right before the part where anyone looks at the code.

This is not an essay against AI automation. I build production AI systems for clients and run three of my own — Conseto uses Claude for weekly briefs, my MCP servers (lead-scout, trend-intel, social-posts) ship daily against real workloads, Kovrin is an open-source agent safety framework I wrote specifically because too much of this space is held together by optimism.

This is an essay about what actually ships. Three patterns I use repeatedly in client work. Code samples. Cost numbers. No demos.

What separates a demo from a system

The demos in 2026 are genuinely impressive. The reality is humbler. A working AI automation system has three properties that demos almost never demonstrate:

Structured boundaries. The model's output is validated against a schema before anything else acts on it.
Idempotency on side effects. The pipeline can crash mid-execution and re-run without producing duplicate outcomes.
Human checkpoints. Anything high-stakes routes to a review queue before execution.

These three properties are what turns "the model said something" into "the system did something correctly." Skip any of them and you ship a pipeline that produces silent bad outputs at 3am on a Sunday. Which is when you find out, because that is when an angry customer email arrives.

The patterns that ship to production look boring next to the demos. Boring is the point.

PROCESS — three engineering schematic boxes labeled Ø 05 INTAKE, ± 0.1 REVIEW, ⌀ 12 OUTPUT with violet circle around REVIEW — The pattern under every shipping AI system: intake, validated review, output. The discipline is what is rendered as tolerance markings.

Pattern 1 — Audit and summarize pipelines

The problem. A client generates or receives lots of documents and somebody has to read them. Legal contracts, compliance reports, customer support tickets, research papers, meeting transcripts. The volume is high enough that nobody reads them carefully; the stakes are high enough that missing something is expensive.

The shape of the solution.

async function auditDocument(doc: { id: string; text: string }) {
  const result = await claude.messages.create({
    model: "claude-sonnet-4-5",
    max_tokens: 2048,
    system: AUDIT_SYSTEM_PROMPT,
    messages: [
      {
        role: "user",
        content: `Audit this document. Output JSON matching the schema. Document:\n\n${doc.text}`,
      },
    ],
  });
 
  const parsed = AuditSchema.parse(JSON.parse(result.content[0].text));
 
  await db.insert(audits).values({
    documentId: doc.id,
    summary: parsed.summary,
    concerns: parsed.concerns,
    risk: parsed.risk,
    confidence: parsed.confidence,
  });
 
  if (parsed.risk >= "HIGH" || parsed.confidence < 0.7) {
    await notifyReviewer(doc.id, parsed);
  }
}

Three things to notice.

1 · Structured output via Zod schema. Not "Claude returns English, you parse it with regex." Zod validates, so a drift from schema is a hard error you can handle.

2 · Automatic escalation on low confidence. The model self-reports confidence (prompted for it), and anything below 0.7 routes to a human reviewer. Not everything has to be 100% model-decided.

3 · Idempotent DB write. If the pipeline crashes mid-document, you can re-run safely. Essential for production — retries happen.

Cost reality. For Claude Sonnet 4.5, a 2,000-token input + 2,048-token output is roughly €0.012 per document. Ten thousand documents = €120. At those numbers, caching becomes critical. Anthropic's prompt caching cuts repeat-system-prompt costs by ~90%, which matters because the system prompt is usually the biggest token chunk.

PIPE — ETL pipeline diagram with INGEST, TRANSFORM, EMIT inputs feeding one pipe with an OUT exit, diagonal hatch background — The audit pattern at a higher altitude: documents enter, the model classifies, structured records exit. The pipe is what makes it idempotent.

Pattern 2 — Structured data extraction

The problem. A client has messy inputs — PDF invoices, freeform emails, scraped web pages, handwritten forms scanned as images — and needs them as clean typed records in Postgres.

The shape.

const InvoiceSchema = z.object({
  vendor: z.string(),
  invoice_number: z.string(),
  date: z.string().regex(/^\d{4}-\d{2}-\d{2}$/),
  subtotal: z.number(),
  tax: z.number(),
  total: z.number(),
  line_items: z.array(z.object({
    description: z.string(),
    quantity: z.number(),
    unit_price: z.number(),
  })),
});
 
async function extractInvoice(pdfBuffer: Buffer) {
  const imageData = await renderPdfToImage(pdfBuffer);
 
  const result = await claude.messages.create({
    model: "claude-sonnet-4-5",
    max_tokens: 4096,
    messages: [{
      role: "user",
      content: [
        { type: "image", source: { type: "base64", media_type: "image/png", data: imageData } },
        { type: "text", text: `Extract invoice data as JSON matching this schema: ${JSON.stringify(zodToJsonSchema(InvoiceSchema))}` },
      ],
    }],
  });
 
  const json = extractJsonFromClaudeResponse(result);
  return InvoiceSchema.parse(json);
}

Two things worth noting.

1 · Multimodal input — PDF becomes image becomes structured JSON. This replaces an entire OCR + regex + manual-review pipeline that used to take three engineers a quarter to build.

2 · Schema validation at the boundary. If Claude returns "€1,234.56" instead of the expected number 1234.56, Zod throws. You handle it explicitly — either retry with a more targeted prompt, or route to human review. What you do not do is silently accept.

When this pattern breaks. Really bad inputs. Scanned handwritten forms from the 1970s, multilingual forms where the model mistranslates currency, duplicate-headed invoices where half the fields repeat. For those, the pattern is not "better prompt engineering" — it is "human-in-the-loop queue with confidence routing." Structured extraction gets you 85% coverage reliably. The 15% still needs a human, and you build the UI for that human before you deploy the pipeline.

Pattern 3 — Workflow choreography

The problem. Multiple steps need to happen in sequence, some need model judgment, some need external API calls, some need human checkpoints. This is where people reach for "autonomous agents" and usually regret it.

The shape — what actually ships.

async function processLead(leadId: string) {
  const lead = await db.select().from(leads).where(eq(leads.id, leadId));
 
  // Step 1: Enrich via external APIs (deterministic, no model)
  const enriched = await enrichLead(lead);
  await db.update(leads).set({ enrichment: enriched }).where(eq(leads.id, leadId));
 
  // Step 2: Model judgment — is this lead qualified?
  const qualification = await claude.messages.create({
    model: "claude-sonnet-4-5",
    max_tokens: 1024,
    system: QUALIFICATION_SYSTEM_PROMPT,
    messages: [{ role: "user", content: JSON.stringify(enriched) }],
  });
  const parsed = QualificationSchema.parse(extractJson(qualification));
 
  if (!parsed.qualified) {
    await db.update(leads).set({ status: "rejected", reason: parsed.reason });
    return;
  }
 
  // Step 3: Human checkpoint — queue for review before outbound
  await db.insert(reviewQueue).values({
    leadId,
    suggestedTemplate: parsed.suggestedTemplate,
    confidence: parsed.confidence,
  });
 
  await notifyReviewer({ leadId, context: parsed });
  // Execution stops here. Human reviews, then triggers step 4 manually.
}
 
async function sendApprovedOutbound(leadId: string, overrides: OutboundOverrides) {
  // Step 4 — runs only after human approval in the review UI
  const lead = await db.select().from(leads).where(eq(leads.id, leadId));
  const draft = await claude.messages.create({
    model: "claude-sonnet-4-5",
    max_tokens: 2048,
    messages: [{ role: "user", content: buildDraftPrompt(lead, overrides) }],
  });
 
  await sendEmail({ to: lead.email, body: draft.content[0].text });
  await db.update(leads).set({ status: "contacted", contactedAt: new Date() });
}

What is different from the "autonomous agent" narrative:

1 · Explicit step boundaries. Each step is its own function, idempotent, logged. No infinite retry loops, no runaway context accumulation.

2 · Deterministic work stays deterministic. External API calls (enrichment) do not go through a model. Database updates do not go through a model. The model is used only for judgment calls.

3 · Human checkpoint is a hard gate. Step 4 does not auto-trigger. The execution literally stops until a human opens the review UI and approves. The model is suggesting, not acting.

4 · State lives in Postgres. Not in an agent's working memory. Agents crash. Postgres does not. Every step commits state.

If you are scoping an AI automation project and want help mapping it onto this kind of structure before you start building, that is exactly what a discovery sprint produces.

CONSTRAIN — central square with four inward arrows labeled COST, TIME, SCOPE, TEAM, blueprint grid background — The four boundaries that make autonomous-agent demos shippable. Without all four, the system either loops or hallucinates.

The pattern that is not here

Fully-autonomous agents that "figure it out" across multi-hour tasks.

I have built them as prototypes. I have had them ship into demos. I have never shipped one into production for a client where real money or real people were on the line. The failure modes — loops, context pollution, silent bad outputs — are too frequent at current model capabilities, and the monitoring story is too immature.

If you want to ship AI into production, the honest pattern is: narrow, structured, human-checkpointed, idempotent. The demos are fine. The shipping systems look boring.

Cost control

Three practices that save real money:

1 · Prompt cache aggressively. Anthropic's cache cuts repeat-context costs by 90%. System prompts, schema definitions, long examples — all cache. Measure hit rate in production.

2 · Fall back to smaller models where appropriate. Claude Haiku at roughly 1/10th the cost is fine for simple classification. Reserve Sonnet (or Opus) for actual judgment.

3 · Hard spend caps. In Anthropic's console you can set daily spend limits. Set them. The one-in-a-thousand runaway loop is the €12,000-bill email from your ops team.

Takeaways — your AI automation playbook for 2026

Pick narrow problems. Audit, extract, choreograph — not "build me an autonomous marketing agent."
Validate at the boundary. Zod schema on every model output. Hard error, not silent drift.
Keep humans in the loop. Anything user-facing, high-stakes, or expensive to reverse goes through a review queue.
State in Postgres. Agent memory is not a database.
Cache, fall back, cap. Money does not take care of itself.
Build the human UI first. The 15% that needs a reviewer is the part most teams forget to design. Do it before the pipeline ships.
Instrument every step. Trace logs, confidence scores, escalation rates. You will need them in month two.

None of this is as fun as the Twitter demos. All of it ships.

Frequently asked

01What is the most reliable AI automation pattern in 2026?

Structured extraction with Zod schema validation at the boundary. The model returns JSON, your code validates it, and anything that fails schema parsing routes to a human reviewer. This pattern ships at ~85% automation coverage reliably across document processing, lead qualification, and ticket triage.

02Are fully autonomous AI agents production-ready?

Not yet for high-stakes work. The failure modes — context loops, silent bad outputs, monitoring gaps — are too frequent at current model capabilities. Narrow, structured, human-checkpointed pipelines are the honest production pattern in 2026. Demos look impressive. Shipped systems look boring.

03How much does Claude or GPT API actually cost in production?

Claude Sonnet 4.5 lands around €0.012 per document for a 2,000-input + 2,048-output token call. Ten thousand documents per month is ~€120. With prompt caching enabled (≥90% cost reduction on repeat context) and proper model routing, real costs are often 5–10× lower than naive estimates.

04What is MCP and why does it matter?

MCP (Model Context Protocol) is the emerging standard for letting AI models call external tools — your database, your APIs, your custom logic — through a structured interface. It is what turns a chatbot into an actual automation system. I run three MCP servers in production: lead-scout, trend-intel, social-posts.

05What happens when the model returns garbage?

Three layers: schema validation at the boundary (Zod throws on invalid output), confidence threshold escalation (anything below 0.7 routes to human review), and idempotent state writes (a crashed run can re-execute safely). Skip any of the three and you ship a system that produces silent bad outputs in production.

Written by Norbert KovalčínIndependent architect · Europe · CETI help companies own their stack instead of renting it. One client at a time.

Book a 30-min call Send a brief

AI Automation: The Practical Guide

What separates a demo from a system

Pattern 1 — Audit and summarize pipelines

Pattern 2 — Structured data extraction

Pattern 3 — Workflow choreography

The pattern that is not here

Cost control

Takeaways — your AI automation playbook for 2026

AI Chatbots for Small Business in 2026: What Actually Works

AI Agents in Business Workflows: What Actually Works in 2026

AI Automation in Practice: Three Slovak Case Studies that Saved €90,000+

New essay every few weeks.