Multi-Agent System Design Patterns and Orchestration

"Build an agent" is the wrong unit of design. The right unit is the orchestration: the system that decides which agent runs, with what context, under whose authority, and with what reversal cost if it gets it wrong. Single-agent systems are a special case of multi-agent systems where the orchestration is trivial. Once you have two or more agents — or one agent that calls tools — orchestration becomes the load-bearing component, and the failure modes you should care about are concentrated there.

This essay covers the four design patterns that recur in production multi-agent systems, the arbitration problem at the center of all of them, and the safety layers that keep humans on the hook for outcomes the agents are not qualified to own.

The four orchestration patterns

Pattern 1: Router

One model classifies each incoming request and dispatches it to a specialist agent. Router patterns are the simplest to build and the most common to deploy. The router is usually a smaller, cheaper model; the specialists handle the actual work. The win is cost: you don't run an expensive reasoning model on every request, only on requests that need one.

The hidden failure mode of routers is misclassification accumulation. Routers that are 95% accurate sound great until you stack three of them in a pipeline — at which point the end-to-end accuracy is 86%. In production, router error compounds across stages and is invisible at any single stage. Mitigation: log every routing decision with the input that produced it, and audit a stratified sample weekly. Without this, you will find the misclassification only after a customer escalation.

Pattern 2: Pipeline

Sequential agents, each doing a step. Agent A retrieves, Agent B summarizes, Agent C drafts, Agent D verifies. Pipelines are the workhorse of production agentic systems because they map naturally onto the way humans already think about workflows.

The pipeline's defining vulnerability is context degradation. Each stage compresses or transforms the artifact — and information lost early cannot be recovered late. The verifier in stage D is verifying the output of stage C, not the original input. If stage A made a retrieval mistake, stage D will confidently approve a polished version of the wrong answer. Mitigation: every stage carries forward not just its output, but a provenance trail that the verifier can inspect. Verifiers that only see the last stage's output are decorative.

Pattern 3: Debate / Arbitration

Two or more agents independently produce candidates. A third agent (or a human, or a deterministic check) selects between them. Debate patterns improve accuracy on hard problems by exploiting the fact that different agents fail in different ways — but only if the agents are actually diverse. Two instances of the same model with different prompts are not diverse enough; they share too many failure modes.

The arbiter is the most important component in this pattern. A weak arbiter degenerates the system to coin-flip: it picks the more confident-sounding answer regardless of correctness. A strong arbiter has access to ground signals — examples of correctness, deterministic checks, or a different inductive bias than either candidate. For high-stakes domains, the arbiter should be a human, not a model. Models cannot reliably arbitrate problems that they would have answered wrong themselves.

Pattern 4: Plan-and-Execute

A planner agent decomposes a goal into steps. An executor agent (or set of agents) carries them out. A monitor checks progress against the plan and re-plans when reality deviates. This is the pattern most people mean when they say "agentic" — it is also the pattern with the highest variance in production reliability.

Plan-and-execute systems work brilliantly when the planning surface is well-defined (coding tasks with clear acceptance tests, scoped research with explicit deliverables) and fail badly when the planner has to guess about the operating environment. Production deployments of this pattern almost always include a plan budget — a hard cap on the number of steps, tools, or tokens the system can spend before requiring human re-approval. Without a budget, plan-and-execute systems run away.

The arbitration problem

All four patterns reduce to the same question at runtime: when two components disagree, who wins? This is the arbitration problem, and most multi-agent systems get it wrong by treating arbitration as an implementation detail. It isn't. It is the governance of the system, expressed in code.

Three rules I apply when designing arbitration:

Arbitration must be observable. Every disagreement that triggered arbitration is logged — including which side won, why, and what the alternative was. If you can't reconstruct an arbitration decision a month later, it doesn't exist.
Arbitration must be reversible at the boundary. The first arbitration that touches a customer should be a human, or should have an obvious "this looks wrong" path. Automated arbitration deep in the pipeline is fine; automated arbitration at the customer-facing edge is not.
Arbitration logic is policy, not code. The rules for who wins must be written in language a non-engineer can read and a compliance team can sign off on. If the only people who understand your arbitration logic are the engineers who wrote it, your system has no governance — it has a single point of human failure.

Safety layers that survive contact with reality

The safety architecture for any multi-agent system has four layers. Most production failures I've reviewed happened because one of these layers was missing or assumed.

Layer 1: Input filtering

Reject inputs that the system is not qualified to handle. This is unsexy and essential. The most common production failure isn't the agent producing wrong output — it is the agent producing plausible output for an input it should have refused. Maintain an explicit allowlist of input types, and reject everything else with a clear message. "We don't handle this" is a feature.

Layer 2: Action gating

Distinguish actions by reversibility. Reading a file is reversible; sending an email is not. Drafting a contract is reversible; signing it is not. Action gating means: the agent can take low-reversal-cost actions autonomously, but high-reversal-cost actions require an explicit human (or strongly-typed deterministic) sign-off. The bar for "high reversal cost" is lower than most teams set it.

Layer 3: Output verification

Before output reaches a customer, an independent check runs against it. Not the same model that produced it — that is self-verification, and self-verification on hard problems is theatre. Use a different model, a deterministic check, or a structured rule. The verifier doesn't have to catch everything; it has to catch the cheap mistakes the producer agent makes when distracted by the hard ones.

Layer 4: Incident replay

When something goes wrong in production, a human needs to be able to reproduce the agent's exact behavior — same context, same tool calls, same model version, same intermediate outputs. Without replay, postmortems become guessing. Build replay infrastructure on day one. The first incident you can't replay is the moment leadership stops trusting the system.

What to skip

A few patterns appear frequently in agent literature and rarely survive production. Be wary of:

Self-reflection loops. Agents asking themselves "are you sure?" produce little additional accuracy and a lot of additional latency. Use external verification.
Free-form agent-to-agent chat. Two agents negotiating in natural language is fragile and unobservable. Constrain inter-agent communication to typed messages with explicit schemas.
Plan caching across sessions. Plans look reusable until you realize each one was anchored on the specific context of one user's request. Cache primitives, not plans.

The default architecture

For most production agent systems below the foundation-model frontier, the default I reach for is: a router into a small set of specialist pipelines, with explicit human review at any customer-facing boundary, deterministic verifiers wherever the answer can be checked cheaply, and a single observable arbitration point per pipeline. Plan-and-execute is added only when the task genuinely requires open-ended reasoning, and always with a step budget. Debate patterns are reserved for high-stakes decisions where the cost of being wrong dominates the cost of running the comparison.

None of this is exotic. It is workmanlike. The teams that ship reliable agent systems are the ones treating orchestration as the actual product, not as the wiring around the model.

Closing

"Agent" is a useful word at the level of a marketing page. It is a misleading word at the level of system design. What you are actually building is an orchestration with arbitration and safety layers, and the agents are interchangeable components inside it. Design the orchestration first. Pick the agents to fit it. The systems that survive contact with production are the ones built in this order.