How to build AI agents that ship real work (not slop)

We spent 250 hours building a multi-agent OpenClaw workflow engine. Here’s everything we wish we’d known when we started.

Lance Jones

Lance Jones

AI Engineer, Toyo

Gavin Belson

Gavin Belson

Lance’s AI agent & orchestrator

How to build AI agents that ship real work

If you’ve tried to make an open agent framework like OpenClaw do real work, you already know: the install is the easy part. Customizing, maintaining, and scaling a system that works reliably every day quickly becomes a full-time job.

We pushed OpenClaw to the limit and built a multi-agent workflow engine that now helps our team do real work: competitive intelligence, SEO research, content, and growth experiments. We use it every day.

Want to build a system like this? Read this guide (or share it with your agent). Our implementation is OpenClaw-specific, but the decisions below apply to any agent framework: Hermes, IronClaw, Claude Code, a custom LangChain build, or whatever ships next.

What this guide covers

  1. One Prompt vs. an Orchestrated System
  2. The blueprint to build your own workflow engine
  3. The five architecture decisions your system needs to do real work:
    1. How do you break complex work into agent-sized steps?
    2. What should the LLM decide vs. what should be scripted?
    3. Should your agents be persistent or disposable?
    4. How do you give agents enough context without drowning them?
    5. How do you validate output when the worker is an LLM?
  4. A deep dive into our competitive intelligence pipeline: live API data, browser automation, structured output, and QA loops
  5. Tips on dos (and don’ts) when getting started with OpenClaw
Proof of Work

Check out what a workflow engine can do

See an example competitive intelligence report synthesized from real data on three of the most popular project management tools: Basecamp, Asana and Trello:

  • Accurate pricing & product information scraped from live websites
  • Real SEO data via Ahrefs
  • Real community sentiment from Reddit and X

A comparable analysis from a strategy consultancy could cost a business $15,000–50,000 and take 4–6 weeks. This report is comparable, took four hours, and cost under $20.

Sample Report

Industry Intelligence: Project Management Incumbents

Toyo + OpenClaw

See the full report

You can’t get this from one prompt

Two approaches to the same task: using an LLM directly (ChatGPT, Claude) vs. a pre-defined, multi-agent workflow with tool calls.

LLM only

Using ChatGPT or Claude

  • Data: Traffic estimates off by 8–13× (guessed Asana at 500K, actual 3.9M)
  • Pricing: Hallucinated numbers (Basecamp listed at $99/mo, actual $299)
  • Sentiment: No community signal — missed the Trello redesign backlash entirely
  • Consistency: Inconsistent format every run. Can’t compare across competitors
  • Quality: No validation. No rubric. No way to know what’s wrong
Orchestrated system

Using an Orchestrated System

  • Data: Live Ahrefs API calls — verified traffic numbers (261K, 3.9M, 2.6M)
  • Pricing: Browser scraping fetches live pricing ($299/mo, $24.99/user/mo)
  • Sentiment: Reddit + X APIs surface live sentiment and breaking news
  • Consistency: Deterministic tool calls: same 5 Ahrefs endpoints, rigid output schema
  • Quality: Rubric-scored QA loop with a feedback-driven self-improvement pass
How to get started

Setting up your agent framework

We only recommend attempting this if you’re a developer, or have technical skills and a lot of free time on your hands. We built ours on OpenClaw, so we wrote up a dedicated setup guide covering everything we wished we’d known when we started:

  • Installing OpenClaw (npm, Homebrew, or source build)
  • Choosing the right models and managing API costs
  • Connecting MCP servers (Ahrefs, Playwright, Brave Search, and more)
  • Hardware and deployment (what we run in production)
  • The mistakes that cost us real money
Read our OpenClaw setup guide →

Want the easy version?

If you don’t have the time or expertise to build a system like this yourself, we’re building Toyo to handle all of this for you.

The Blueprint

How to build a multi-agent workflow engine

Once you’re past setup — framework installed, real tool calls working, a model strategy that doesn’t drain your wallet — the instinct is to jump straight into the shiny parts: multi-agent orchestration, self-improving pipelines, the full stack. We’ve watched teams do this and ship systems that demo brilliantly and collapse under real workloads. Start smaller than you think.

Build it one layer at a time, in the order below. Skip a layer and you’ll ship something that demos well and fails in production. Don’t add the next layer until the current one works.

Start with a Playbook

LLMs are non-deterministic: hand the same model a fuzzy prompt twice and you’ll get two different answers. Each agent in our system gets a playbook: a markdown file with task instructions, output schema, and examples. It’s the contract between the system and the agent, and it’s the reason the next run behaves like the last one.

When QA fails, you don’t debug the LLM, you debug the playbook. No playbook, no repeatability.


  1. Step 1

    Pick one workflow

    Not "marketing automation" — one specific, repeatable task: "generate a competitor profile," "draft a weekly sales update," "produce a keyword research brief." The narrower the task, the faster the playbook loop.


  2. Step 2

    One agent, one tool, one output

    Can you get consistent structured output? Run 3 times on 3 inputs. If structure differs, your playbook isn't specific enough.


  3. Step 3

    Two agents, one handoff

    Research Agent → research.md → Synthesis Agent → profile.md. Can agent B reliably consume agent A's output? Output schemas are the API contract.


  4. Step 4

    Add QA

    QA agent scores against a rubric. On fail, re-run with failure feedback. Every QA failure is a playbook bug report.


  5. Step 5

    Add a human gate

    Kill bad analysis at minute 2 instead of minute 40. A human glancing at a scope document is the cheapest QA you have.


  6. Step 6

    Add a second data source

    One agent with two tools or two agents with one tool each? Almost always two agents. This is where ephemeral architecture proves its worth.

Common mistakes to avoid

Running persistent agents from day one

Start ephemeral, add persistence later if you actually need it.

Letting agents choose their own tools

Specify exact tool names in your playbook.

Skipping output schemas

Without a rigid template, every run produces a different format.

No QA step

Add one the first time you catch a bad output manually.

Five decisions that make or break your system

The rest of this guide is built around five decisions. Get these right, and your system will be able to deliver real work. Get them wrong and you’ll burn tokens on slop.

These decisions apply whether you’re orchestrating agents directly or through a coordination layer like Paperclip.

Using our competitive intelligence workflow as a concrete example, let’s dive in.

Decision #1

How do you break complex work into agent-sized steps?

The difference between a workflow that costs $6 and one that wastes $200 on the same task comes down to how you decompose the work.

The instinct is to give one agent the whole job. For simple tasks, that works. For work involving multiple data sources, different tools, or output that needs to be consistent across runs, it falls apart.

Each step should have exactly one job and one output file. Why? Single-purpose steps are debuggable, retryable, and parallelizable.

We tried a 3-step version (research, synthesize, deliver) where one agent juggling Brave, Reddit, X, and Ahrefs produced wildly inconsistent coverage. A 10-step version hit the opposite problem: handoff overhead exceeded the value. Six steps was the sweet spot for this workflow.

  1. GavinGavin Belson (orchestrator)Claude Opus~2 min

    Human creates a task. Orchestrator drafts scope: which competitor, what questions, what 'done' looks like. The proposal gate exists because killing a bad analysis at minute 2 saves 40 minutes of wasted compute.

  2. Human Review Gate
  3. Big HeadNelson Bighetti (marketing researcher)Claude Sonnet~8-15 min

    Brave Search + Reddit (r/SaaS, r/artificial, r/smallbusiness) + X/Twitter. Minimum 15 distinct sources. Extracts positioning, funding, features, pricing signals, community sentiment.

    Brave SearchReddit APIX/Twitter API
  4. Big HeadNelson Bighetti (marketing researcher)Claude Sonnet~5-10 min

    Pulls Toyo's baseline first for comparison. Then for the competitor: DR, traffic, keywords, pages, content gaps. The SEO gate exists because you want a human to approve scrape targets before sending headless browsers crawling production websites.

    Ahrefs API (5 of 108 tools)
  5. Human Review Gate
  6. Big HeadNelson Bighetti (marketing researcher)Claude Sonnet~10-20 min

    Playwright scrapes pricing, features, top pages from SEO research, Reddit threads. CAPTCHAs, login walls, and rate limits are the norm. The agent documents what's blocked and moves on.

    Playwright (headless Chromium)
  7. LaurieLaurie Bream (business strategist)Claude Opus~5-8 min

    Takes the three research files for a single company (web-research.md, seo-research.md, deep-scrape.md) and consolidates them into one standardized profile. Cross-references data between sources, resolves conflicts, populates every field. No empty sections. 'Unknown' rather than omitting.

  8. GavinGavin Belson (orchestrator)Claude Sonnet~3-5 min

    Checks expected outputs exist, writes to SQLite database, updates the competitor dashboard, opens a PR with profile files. Runs even if prior steps failed. Your workflow should produce artifacts, not just text.

The workflow outlined above produces a single competitor profile. A separate synthesis workflow reads all completed profiles and produces the cross-competitor analysis: the comparison tables, keyword overlaps, threat tiers, and content roadmap you saw in the demo report. Same architecture, different playbooks.

When should you split a step?

  1. 1.Does it use a different tool? → Probably a separate step.
  2. 2.Would you retry it independently if it failed? → Definitely separate.
  3. 3.Must the output be inspectable on its own? → Separate step.
  4. 4.Would combining it require significantly more context? → Separate step.
Decision #2

What should the LLM decide vs. what should be scripted?

Every step in your workflow sits somewhere on a spectrum between fully deterministic (code decides) and fully LLM (the model decides). Where should you draw the line?

The instinct is to let the LLM handle everything. It’s capable, so why constrain it? Because capable and consistent are different things. An LLM will produce a correct answer every time. It won’t produce the samecorrect answer every time. And when you’re running the same workflow across 29 competitors, “different but correct” is the same as broken.

Here’s what that looked like for us:

Ahrefs exposes 108 tools through its MCP server. Early on, we let our SEO research agent pick whichever ones seemed relevant to the job. The problem showed up on the second run: the first report pulled Domain Rating, organic traffic, and top keywords. The second pulled Domain Rating, backlinks, and content gaps. Both reports were accurate, but we couldn’t put them side by side — we were measuring competitors with different rulers.

If your workflow produces different outputs from the same input, the LLM is probably making decisions that should be locked down in the playbook.

Our ratio: ~60% deterministic, 40% LLM. If yours is above 50% LLM, expect consistency problems.

Fully deterministicFully LLM

Step Dispatch

Pure code. Engine follows the graph. No LLM decides what happens next.

Ahrefs Queries

Playbook names exactly 5 of 108 tools. Agent doesn’t freestyle.

Web Scraping

Deterministic navigation. LLM extraction, because every pricing page has different markup.

Profile Synthesis

Pure LLM. Cross-referencing sources, resolving conflicts, writing analysis.

How do you decide which steps to script?

Every step in a workflow has a natural home on the spectrum above. Get the assignment wrong and you pay for it twice: script a task that needs judgment and the agent breaks on the first edge case, let the LLM own a task that should be deterministic and the output drifts every run. The rule of thumb below makes the call easy.

When to use a script?

  • Action is the same every time
  • Consistency > creativity
  • You need to compare outputs across different workflow runs
  • Failure modes need to be predictable

When to call an LLM?

  • Input is unstructured (arbitrary HTML)
  • Judgment required
  • Task is creative
  • No two inputs look the same

How it works: OpenClaw MCP servers

MCP (Model Context Protocol) is what connects AI agents to external tools. Instead of giving an agent a vague instruction to “look up SEO data,” MCP exposes specific, callable tools like site-explorer-domain-rating, site-explorer-metrics that return structured data.

Our system runs 6 MCP servers in production:

Ahrefs

Ahrefs

SEO data, keywords, backlinks

Playwright

Playwright

Scraping, screenshots

Brave

Brave Search

Company research

Reddit

Reddit

Community sentiment

X/Twitter

Social signals

QMD

Long-term context

MCP is what makes deterministic tool calls possible. The playbook names specific MCP tools. The agent calls them. No guessing, no freestyle. MCP is an open standard — the same servers work with Hermes, Claude Code, and any MCP-compatible agent. The tool integrations you build are portable across frameworks.

Decision #3

Should your agents be persistent or disposable?

The biggest cost driver in any multi-step agent system is whether your agents remember what they did last time.

A persistent agent carries every prior message forward. Every call hauls the whole history. Costs compound. Crashes lose everything. Parallelism is impossible.

An ephemeral agent starts fresh each step and writes its output to a file. Costs stay flat. Crashes recover. You can run four at once.

The 86 million token incident

On March 28, 2026, one of our agents spent the afternoon editing an article — small changes, a word here, a sentence there. By the end of the day it had burned 86 million tokens. Roughly $400 of work that should have cost $20.

The cause: every edit shipped the agent’s entire conversation history to the API. The first edit of the day cost ~100K tokens. The last cost 2.5M. Same operation, 25× more expensive, because the context window kept accumulating. A fresh ephemeral agent would have used ~500K tokens total for those 30 edits. The persistent agent used 86M — a 170× overhead.

We switched to ephemeral agents the same week, and daily token usage dropped 10× overnight. The same failure mode shows up on any framework that keeps an agent alive across steps — Hermes, IronClaw, a LangGraph chain. It’s a property of how LLMs handle context, not a bug in any one tool.

Token accumulation is one of three structural reasons persistent agents don’t survive production:

Every Action Gets More Expensive

First API call: 50K tokens. By afternoon: 750K tokens. Same call. Persistent agents have a compounding cost curve that makes them economically irrational for production.

One Crash Kills Everything

Persistent agent doing steps 1-4 crashes at step 3. Steps 1-2 exist only in conversation context, gone. With ephemeral agents, steps 1-2 are files on disk. Retry step 3 with a fresh agent.

You Can’t Parallelize

Persistent agent is a single thread. 29 competitors queue sequentially. Ephemeral agents: 4 concurrent = 7x throughput on batch work. This is the difference between 5 days and 5 weeks.

Why 9 distinct agent roles, not 1 or 3?

A single general-purpose agent produces mediocre results at everything. A hyper-specialized agent per step produces excellent results at one thing. The trade-off is prompt engineering effort: each role needs a template, each step needs a playbook. We settled on 9 roles across 13 workflows. The roles are reusable; the playbooks aren’t.

OpenClaw skills: composable role + playbook combinations

The system uses a composable markdown skill architecture. Each agent gets two documents: a role template (who you are, your quality standards, your tools) and a step playbook (what to do right now, in what order, with what output format). From 9 roles and 46 playbooks, the system can assemble over 400 distinct agent configurations.

When you improve a role template, every workflow using that role gets better. When you improve a step playbook, only that step improves. Different rates of change, different scopes. This separation is what makes the system scale without requiring a rewrite every time you add a new workflow.

RoleCharacterModelUsed In
Lead EngineerRichard HendricksRichard HendricksClaude Opus / SonnetDev workflows, bug fixes
Frontend EngineerDinesh ChugtaiDinesh ChugtaiClaude SonnetDev workflows, bug fixes
Backend EngineerBertram GilfoyleBertram GilfoyleClaude SonnetDev workflows, bug fixes
QA ReviewerErlich BachmanErlich BachmanClaude OpusAll workflows (verify steps)
Marketing DirectorRuss HannemanRuss HannemanClaude SonnetContent, strategy
Marketing ResearcherNelson BighettiNelson BighettiClaude Sonnet / OpusCompetitive analysis, SEO, content
Business StrategistLaurie BreamLaurie BreamClaude OpusStrategy, opportunity research
UI DesignerMonica HallMonica HallClaude SonnetWeb design
Web CopywriterJared DunnJared DunnClaude SonnetWeb design

What each agent doesn’t know

Ephemeral agents lose institutional context within a workflow. The Asana agent doesn’t know Basecamp exists. It can’t say “unlike Basecamp, Asana takes a different approach” because it’s never seen the Basecamp profile. Cross-competitor awareness only arrives at the synthesis layer. That limitation is the price of not burning 86M tokens.

Decision #4

How do you give agents enough context without drowning them?

Every token of context costs money. A persistent agent carrying 500K tokens of history turns a $0.50 task into a $2 task.

Context costs tokens, and tokens cost money. The entire architecture is shaped by one question: how do you give each agent exactly the context it needs and nothing more?

Within-Run: Files on Disk

Web-research writes 5KB. SEO writes 5KB. Deep-scrape writes 15KB. Profile-consolidate reads all three (~25KB). That’s 25K tokens of relevant context. A persistent agent would be carrying 500K.

Across-Runs: Indexed Markdown

Orchestrator only. Markdown files indexed by hybrid search: BM25 + vector embeddings + LLM reranking. Workers get zero long-term memory. The format matters less than the discipline.

What to remember, what to look up again

If it’s...Then...Because...
Fact that changes (traffic, pricing)Re-derive from sourceYesterday’s data is already stale
Decision (“EasyAsk is closed”)Persist in memoryExpensive to reconstruct reasoning
Procedure (“use global volume”)Persist in memoryAvoid repeating the same mistake
Intermediate computationPass as file to next stepOnly needed within this run
Cross-run patternPersist in databasePrevents duplicate work

In practice, teams persist everything (too expensive, memory becomes noisy) or persist nothing (agents repeat mistakes). The discipline is deciding for each piece of information which category it falls into.

Decision #5

How do you validate output when the worker is an LLM?

LLMs hallucinate confidently. Traffic numbers off by 10×, pricing two years out of date, sources that don’t exist — all rendered in clean prose. Nothing in the output tells you it’s wrong. That signal has to come from somewhere outside the LLM.

LLM output fails quietly. You can’t tell when it’s wrong. A missing data section looks the same as a complete one if you’re not checking against a schema. A hallucinated pricing tier is indistinguishable from a real one unless you compare it to the scraped source.

You can’t eyeball LLM output at scale. You need automated QA. But automated QA has its own limits. We score every output against a weighted rubric across four dimensions:

30%
Completeness
Did agent cover everything the playbook required?
25%
Accuracy
Facts correct? Numbers match sources?
20%
Clarity
Well-structured and readable?
25%
Domain
Voice, source integration, data support

What a QA failure looks like in practice

qa-reviewer — ehrlich (opus)
FAILScore: 58/100
[HIGH] Pricing section incomplete — only 2 of 4 tiers listed. deep-scrape.md lines 23-41 contain the full pricing table with Personal (free), Starter ($10.99/user/mo), Advanced ($24.99/user/mo), and Enterprise (custom). Include all 4 tiers.
[MED] SWOT Opportunities section thin — 2 items where research supports 4-5 concrete opportunities.
[LOW] Traffic value format inconsistent — use “$X,XXX,XXX” format.
retry 1 →PASSScore: 81/100

Above: the first pass scored 58/100 with three severity-tagged findings. Those specific findings — not a generic “improve quality” note — get handed to the retry agent, which passes on retry 1 at 81/100. “Lines 23-41 have the data you missed, include all 4 tiers” is useful. “Improve quality” is useless. Retries score 15-20 points higher. Max 3 retries, ~95% pass by retry 2 — a reliability rate on work that used to require a $150/hr analyst.

QA catches what’s wrong. Humans catch what’s missing.

Automated QA finds missing sections, stale data, AI writing patterns. It will never catch whether the analysis is interesting, whether strategic recommendations are right, or whether the content will resonate. Human gates exist alongside QA because they catch a different category of failure.

Concrete Examples

How to structure your prompts

Every architectural decision from the previous sections manifests in the prompt. Four rules cover most of what we learned — each paired with a before/after from our real prompts.

1. Name the exact tools and counts

Agents will guess at specifics if you let them. Name the tool by its exact ID. Set the count as a number. The vaguer the instruction, the more decisions the LLM makes on your behalf — and the more those decisions drift between runs.

Vague research brief
# Task
Research the SEO presence of Basecamp.
Pull keyword data and traffic numbers.
Report your findings.
Tool-by-tool execution plan
# Step: seo-research

## Your Job
0. First, pull Toyo's own baseline — run against toyo.ai:
   a. site-explorer-domain-rating → DR + Ahrefs rank
   b. site-explorer-metrics → organic traffic, keyword count
   c. site-explorer-organic-keywords → top 20 keywords
   d. site-explorer-pages-by-traffic → top 10 pages
   e. site-explorer-referring-domains → count

1. For each competitor domain, pull:
   a. Domain overview: DR, monthly traffic, referring domains
   b. Top organic keywords (top 20 by traffic)
   c. Top pages by organic traffic (top 10)
   d. Content gap analysis vs toyo.ai

2. Template the output, don’t describe it

Asking for “a comprehensive profile” gets you five different shapes across five runs. Hand the agent a markdown table with headers and empty cells, and every run fills the same slots. Structure on the way out is what lets you compare runs at all.

Vague output instruction
Write a comprehensive competitor profile covering
all relevant aspects of the company.
Rigid schema
## Company Overview
| Field | Value |
|-------|-------|
| Company Name | |
| Domain | |
| Founded | |
| Funding/Stage | |
| Team Size | |

## SEO Metrics
| Metric | Value |
|--------|-------|
| Domain Rating (DR) | |
| Monthly Organic Traffic | |
| Referring Domains | |

Every field MUST be populated.
"Unknown" for missing data — never omit.

3. Every constraint is a bug fix for a behavior you’ve seen

Most lines in a production system prompt exist because we watched the agent fail without them. “Execute one step, then terminate.” “Do not redo prior work.” “Never omit fields — use Unknown.” Each sentence has a story.

No system context
You are a helpful AI assistant.
Please complete the following task.
The Elvis preamble (our agent architecture)
You are an ephemeral agent in the Elvis architecture.
- Execute ONE step, then terminate
- Prior step outputs provided below — use them, do not redo prior work
- Follow the output format exactly
- Do not coordinate with other agents or delegate work

4. Feed failures back into the retry

When QA rejects an output, don’t re-run the same prompt. Append the failure report — what scored low, why, and what specifically to fix — and let the agent retry with that context. A retry without feedback is just a hopeful second dice roll.

First attempt failed
[Original prompt resubmitted — no specific feedback]
Retry with failure context
[Original prompt]

## Previous Attempt Failed
Score: 58/100 — FAIL

[HIGH] Pricing section incomplete — only 2 of 4 tiers.
deep-scrape.md lines 23-41 contain full pricing.
Include all 4 tiers with exact pricing.

[MED] SWOT Opportunities thin — research supports
4-5 concrete opportunities, not 2.

[LOW] Traffic value format: use "$X,XXX,XXX".

One more test. Would two different LLMs produce structurally similar output from your prompt? If not, the prompt is carrying too much of the decision — tighten it until the answer is yes.

From guide to product

Where do you go from here?

If this page was interesting, what you do next depends on how technical you are and how you value your time.

Build it yourself

This guide is the spec.

This guide is the full spec. Nine roles, thirteen workflows, forty-six playbooks. Expect 200+ engineering hours before it runs reliably.

Use what we built

Toyo is this system, productized.

Toyo productizes the whole system. Competitive intelligence, SEO research, content, and growth experiments — ready to run, no setup.

Read the whole guide? Skip the line.

We’re granting access to Toyo in small batches. Sign up now, and mention this guide in your onboarding call — we’ll move you to the top of the next batch.