AI Automation: The Practical Guide
Every week someone shows me a 20-minute video of an AI agent autonomously building a SaaS. The video always stops before anyone looks at the code. Here is what actually ships — three production patterns, real code, real cost numbers, no demos.

Every quarter, somebody shows me a video of an AI agent autonomously building an entire SaaS product in twenty minutes. Usually the video stops right before the part where anyone looks at the code.
This is not an essay against AI automation. I build production AI systems for clients and run three of my own — Conseto uses Claude for weekly briefs, my MCP servers (lead-scout, trend-intel, social-posts) ship daily against real workloads, Kovrin is an open-source agent safety framework I wrote specifically because too much of this space is held together by optimism.
This is an essay about what actually ships. Three patterns I use repeatedly in client work. Code samples. Cost numbers. No demos.
What separates a demo from a system
The demos in 2026 are genuinely impressive. The reality is humbler. A working AI automation system has three properties that demos almost never demonstrate:
- Structured boundaries. The model's output is validated against a schema before anything else acts on it.
- Idempotency on side effects. The pipeline can crash mid-execution and re-run without producing duplicate outcomes.
- Human checkpoints. Anything high-stakes routes to a review queue before execution.
These three properties are what turns "the model said something" into "the system did something correctly." Skip any of them and you ship a pipeline that produces silent bad outputs at 3am on a Sunday. Which is when you find out, because that is when an angry customer email arrives.
The patterns that ship to production look boring next to the demos. Boring is the point.

Pattern 1 — Audit and summarize pipelines
The problem. A client generates or receives lots of documents and somebody has to read them. Legal contracts, compliance reports, customer support tickets, research papers, meeting transcripts. The volume is high enough that nobody reads them carefully; the stakes are high enough that missing something is expensive.
The shape of the solution.
async function auditDocument(doc: { id: string; text: string }) {
const result = await claude.messages.create({
model: "claude-sonnet-4-5",
max_tokens: 2048,
system: AUDIT_SYSTEM_PROMPT,
messages: [
{
role: "user",
content: `Audit this document. Output JSON matching the schema. Document:\n\n${doc.text}`,
},
],
});
const parsed = AuditSchema.parse(JSON.parse(result.content[0].text));
await db.insert(audits).values({
documentId: doc.id,
summary: parsed.summary,
concerns: parsed.concerns,
risk: parsed.risk,
confidence: parsed.confidence,
});
if (parsed.risk >= "HIGH" || parsed.confidence < 0.7) {
await notifyReviewer(doc.id, parsed);
}
}Three things to notice.
1 · Structured output via Zod schema. Not "Claude returns English, you parse it with regex." Zod validates, so a drift from schema is a hard error you can handle.
2 · Automatic escalation on low confidence. The model self-reports confidence (prompted for it), and anything below 0.7 routes to a human reviewer. Not everything has to be 100% model-decided.
3 · Idempotent DB write. If the pipeline crashes mid-document, you can re-run safely. Essential for production — retries happen.
Cost reality. For Claude Sonnet 4.5, a 2,000-token input + 2,048-token output is roughly €0.012 per document. Ten thousand documents = €120. At those numbers, caching becomes critical. Anthropic's prompt caching cuts repeat-system-prompt costs by ~90%, which matters because the system prompt is usually the biggest token chunk.

Pattern 2 — Structured data extraction
The problem. A client has messy inputs — PDF invoices, freeform emails, scraped web pages, handwritten forms scanned as images — and needs them as clean typed records in Postgres.
The shape.
const InvoiceSchema = z.object({
vendor: z.string(),
invoice_number: z.string(),
date: z.string().regex(/^\d{4}-\d{2}-\d{2}$/),
subtotal: z.number(),
tax: z.number(),
total: z.number(),
line_items: z.array(z.object({
description: z.string(),
quantity: z.number(),
unit_price: z.number(),
})),
});
async function extractInvoice(pdfBuffer: Buffer) {
const imageData = await renderPdfToImage(pdfBuffer);
const result = await claude.messages.create({
model: "claude-sonnet-4-5",
max_tokens: 4096,
messages: [{
role: "user",
content: [
{ type: "image", source: { type: "base64", media_type: "image/png", data: imageData } },
{ type: "text", text: `Extract invoice data as JSON matching this schema: ${JSON.stringify(zodToJsonSchema(InvoiceSchema))}` },
],
}],
});
const json = extractJsonFromClaudeResponse(result);
return InvoiceSchema.parse(json);
}Two things worth noting.
1 · Multimodal input — PDF becomes image becomes structured JSON. This replaces an entire OCR + regex + manual-review pipeline that used to take three engineers a quarter to build.
2 · Schema validation at the boundary. If Claude returns "€1,234.56" instead of the expected number 1234.56, Zod throws. You handle it explicitly — either retry with a more targeted prompt, or route to human review. What you do not do is silently accept.
When this pattern breaks. Really bad inputs. Scanned handwritten forms from the 1970s, multilingual forms where the model mistranslates currency, duplicate-headed invoices where half the fields repeat. For those, the pattern is not "better prompt engineering" — it is "human-in-the-loop queue with confidence routing." Structured extraction gets you 85% coverage reliably. The 15% still needs a human, and you build the UI for that human before you deploy the pipeline.
Pattern 3 — Workflow choreography
The problem. Multiple steps need to happen in sequence, some need model judgment, some need external API calls, some need human checkpoints. This is where people reach for "autonomous agents" and usually regret it.
The shape — what actually ships.
async function processLead(leadId: string) {
const lead = await db.select().from(leads).where(eq(leads.id, leadId));
// Step 1: Enrich via external APIs (deterministic, no model)
const enriched = await enrichLead(lead);
await db.update(leads).set({ enrichment: enriched }).where(eq(leads.id, leadId));
// Step 2: Model judgment — is this lead qualified?
const qualification = await claude.messages.create({
model: "claude-sonnet-4-5",
max_tokens: 1024,
system: QUALIFICATION_SYSTEM_PROMPT,
messages: [{ role: "user", content: JSON.stringify(enriched) }],
});
const parsed = QualificationSchema.parse(extractJson(qualification));
if (!parsed.qualified) {
await db.update(leads).set({ status: "rejected", reason: parsed.reason });
return;
}
// Step 3: Human checkpoint — queue for review before outbound
await db.insert(reviewQueue).values({
leadId,
suggestedTemplate: parsed.suggestedTemplate,
confidence: parsed.confidence,
});
await notifyReviewer({ leadId, context: parsed });
// Execution stops here. Human reviews, then triggers step 4 manually.
}
async function sendApprovedOutbound(leadId: string, overrides: OutboundOverrides) {
// Step 4 — runs only after human approval in the review UI
const lead = await db.select().from(leads).where(eq(leads.id, leadId));
const draft = await claude.messages.create({
model: "claude-sonnet-4-5",
max_tokens: 2048,
messages: [{ role: "user", content: buildDraftPrompt(lead, overrides) }],
});
await sendEmail({ to: lead.email, body: draft.content[0].text });
await db.update(leads).set({ status: "contacted", contactedAt: new Date() });
}What is different from the "autonomous agent" narrative:
1 · Explicit step boundaries. Each step is its own function, idempotent, logged. No infinite retry loops, no runaway context accumulation.
2 · Deterministic work stays deterministic. External API calls (enrichment) do not go through a model. Database updates do not go through a model. The model is used only for judgment calls.
3 · Human checkpoint is a hard gate. Step 4 does not auto-trigger. The execution literally stops until a human opens the review UI and approves. The model is suggesting, not acting.
4 · State lives in Postgres. Not in an agent's working memory. Agents crash. Postgres does not. Every step commits state.
If you are scoping an AI automation project and want help mapping it onto this kind of structure before you start building, that is exactly what a discovery sprint produces.

The pattern that is not here
Fully-autonomous agents that "figure it out" across multi-hour tasks.
I have built them as prototypes. I have had them ship into demos. I have never shipped one into production for a client where real money or real people were on the line. The failure modes — loops, context pollution, silent bad outputs — are too frequent at current model capabilities, and the monitoring story is too immature.
If you want to ship AI into production, the honest pattern is: narrow, structured, human-checkpointed, idempotent. The demos are fine. The shipping systems look boring.
Cost control
Three practices that save real money:
1 · Prompt cache aggressively. Anthropic's cache cuts repeat-context costs by 90%. System prompts, schema definitions, long examples — all cache. Measure hit rate in production.
2 · Fall back to smaller models where appropriate. Claude Haiku at roughly 1/10th the cost is fine for simple classification. Reserve Sonnet (or Opus) for actual judgment.
3 · Hard spend caps. In Anthropic's console you can set daily spend limits. Set them. The one-in-a-thousand runaway loop is the €12,000-bill email from your ops team.
Takeaways — your AI automation playbook for 2026
- Pick narrow problems. Audit, extract, choreograph — not "build me an autonomous marketing agent."
- Validate at the boundary. Zod schema on every model output. Hard error, not silent drift.
- Keep humans in the loop. Anything user-facing, high-stakes, or expensive to reverse goes through a review queue.
- State in Postgres. Agent memory is not a database.
- Cache, fall back, cap. Money does not take care of itself.
- Build the human UI first. The 15% that needs a reviewer is the part most teams forget to design. Do it before the pipeline ships.
- Instrument every step. Trace logs, confidence scores, escalation rates. You will need them in month two.
None of this is as fun as the Twitter demos. All of it ships.
Related: How AI Agents Are Transforming Business Workflows · AI Chatbots for Small Business · AI Coding Tools Comparison 2026