Beyond the Prompt: How an Adversarial AI Council Solved the Hallucination Problem

I asked Claude about European food and it told me about pasta.

That’s not an exaggeration. I typed a careful prompt about building a culinary taxonomy for 18 countries, and the response read like the back of a tourist brochure. France makes baguettes. Spain has tapas. Italy does pizza. The kind of answer you’d get from someone who learned about food from an airport gift shop.

I was treating a single AI like a magic box. Put a question in, get an answer out. So I tried something different. I built a council.

Five agents arguing about food

Claude Code recently shipped agent teams, an experimental feature where multiple Claude instances work as a coordinated team instead of a single session. Each teammate gets its own context window, its own tools, and the ability to message other teammates directly. There’s a lead who coordinates, a shared task list, and a messaging system that lets agents challenge each other’s work in real time.

That last part is what matters here. Traditional subagents report results back to a parent and never talk to each other. Agent teams talk to each other. They argue.

I set up five teammates, all running Claude 3 Opus. Dr. Marchand covered Romance-language countries. Dr. Finch handled the British Isles. Dr. Solberg took the Nordics. Dr. Lehner covered the Germanic bloc. And Prof. Costas, the moderator and team lead, kept them honest.

Each agent got a full academic identity, not “you are a food expert” but a backstory with real weight. Dr. Solberg grew up on a farm above the Arctic Circle, studied microbiology at Oslo, and did research at the Nordic Food Lab with René Redzepi. Dr. Lehner was raised in Basel at the intersection of French, German, and Swiss culinary traditions. These histories steer the model into deeper regions of its latent knowledge, the part where Nordic fermentation science or Alpine cross-border cuisine actually lives. Without that grounding, five agents produce five versions of the same Wikipedia summary. With it, they bring different bodies of knowledge to the table, and that’s what makes the arguments real.

The first round was research: each specialist claimed tasks from the shared list, investigated their countries, and posted their assessments. The second round was cross-examination, where Costas directed the teammates to challenge each other’s findings through direct messages.

This is where the agent teams feature earned its keep. Dr. Marchand messaged Finch directly to challenge his rating of the UK as “Very Large” in dish depth. Her argument was specific: British food has regional charm, but it doesn’t have the sub-categories that define the top tier. Where’s the patisserie tradition? The charcuterie? The cheese taxonomy? France and Spain have those layers. The UK doesn’t. Finch had to go back and find better evidence or accept the downgrade.

With subagents, that exchange can’t happen. Each agent does its work and reports back to the parent. There’s no mechanism for one specialist to confront another. Agent teams gave me a room where five experts could argue. The argument is where the real accuracy came from.

The session retrospective captured it: “The most productive dynamic was the tension between specialists… each specialist has deep knowledge of their bloc but limited ability to calibrate against others without structured comparison.”

The Lake Wobegon round

Every specialist came in hot. In the first round, they all rated their countries as “Substantial” or “Enormous” in depth. This is the Lake Wobegon problem: every expert thinks their domain is above average.

Prof. Costas fixed this in Round 3 with anchor examples. She forced the agents to agree on what the top tier actually looks like before anyone could claim to be in it. France, Italy, and Spain (countries with more than 1,000 genuinely distinct documented dishes) became the anchors for “Very High.” Once that baseline existed, the UK and Germany dropped to “High.” The Nordic countries settled at “Medium.”

Without a moderator agent acting as a calibration layer, every specialist would have graded their own homework and given themselves an A.

Germany’s 3,200 breads

The agents turned up a gap I hadn’t expected.

I had the agents evaluate each country on two axes: culinary depth and search demand. Most countries lined up roughly where you’d expect. France scores high on both, for instance. But Germany and Belgium broke the pattern.

Germany has a UNESCO-recognized bread culture with over 3,200 registered varieties. More than 1,500 types of sausage. Belgium’s Michelin stars per capita (11.3 per million) actually exceed France’s (9.4 per million). But both countries scored a 2 out of 5 on search demand. English-language cookbook sales, restaurant coverage, food media attention. None of it reflects what’s actually there. The stereotype of “beer and bratwurst” is doing real damage to the data.

A single researcher might miss this because the gap exists between two different kinds of evidence. One agent knows the culinary depth. Another tracks the market signals. The mismatch only becomes visible when you force them to look at the same country from different angles.

The Julia Child argument

The council hit a deadlock on France’s “culinary authority,” the voice that would represent French cuisine on the platform. Dr. Marchand wanted Jacques Pépin. He’s a native. He cooked for three French heads of state. His technical credentials are untouchable.

But Costas pushed back. This is an English-language platform. Who translates French technique into something an English-speaking home cook can actually use?

They picked Julia Child. Pépin became the secondary authority.

This one stuck with me. It wasn’t really a debate about chefs. It was about what you’re optimizing for. If you want technical purity, you pick the native expert. If you want someone your users will learn from, you pick the translator. The council chose the teacher over the purist, and the reasoning was sound. Child’s gift was making complex French methods feel possible in an American kitchen. That’s a UX decision, not a culinary one.

Finch’s dissent on Ireland

In the final calibration, Dr. Finch argued that Ireland’s search demand score was too low. The council voted him down. Moderate demand, they said.

Costas recorded his dissent anyway. The system was designed to preserve minority opinions as a fail-safe: not to overrule them, but to flag them for additional research.

Minutes later, a follow-up research round surfaced the evidence. 3.32% of all U.S. restaurants now feature Irish dishes. The Wexford in Savannah was named USA Today’s 2025 Best Restaurant. Nathan Anthony’s “Bored of Lunch” cookbooks were selling across the English-speaking world.

Costas flipped the score to “Strong.”

Majority-rules would have buried this. The average score would have been wrong about Ireland, and we’d never have known until the market data made it obvious. Finch’s dissent was the early signal, and the system was built to keep it on the record.

I went in trying to build a food taxonomy. What I got was a different way to think about AI altogether.

A single prompt gets you a single answer and no way to know if it’s the real thing or the airport gift shop version. The council gave me something different: five competing arguments, forced calibration, and a record of every dissent. I can actually check that work.

Which version would you trust?

The prompt

Here’s the orchestration prompt I used to run the council. It references a 500-line design document that contains the real engineering: full academic backstories for each agent (university affiliations, published works, known biases), a structured three-round deliberation protocol, self-calibrating scoring tiers where the council defines its own boundaries rather than using pre-set scales, consensus rules (3-of-4 agreement passes, moderator breaks ties), cross-border rules (culinary affinity overrides geography, so Alsace goes to the Germanic specialist, not the French one), and a mandatory session retrospective where the moderator reflects on what worked and what didn’t.

The prompt below is the execution instruction that kicks off the session. The design document is where the orchestration logic lives.

# Execute: European Culinary Council — Taxonomy Deliberation

## Architecture: Agent TEAM (not sub-agents)

> **CRITICAL**: This is a **deliberative agent team**. All 5 agents are spawned
> as **standalone teammates** by the main session (you, the team lead). No agent
> spawns other agents. Teammates communicate with each other via **SendMessage**.
> Work phases are coordinated via **TaskCreate/TaskUpdate**. This is a flat peer
> architecture — NOT a hierarchy where one agent spawns sub-agents.

## Your Task

You are executing a multi-agent deliberation to produce a European cuisine
taxonomy. The complete design document is at:

**`docs/plans/2026-03-18-european-cuisine-taxonomy-design.md`**

Read it in full before doing anything. It is the single source of truth for
this session.

## Execution Steps

### Step 1: Read the Design Document

Read `docs/plans/2026-03-18-european-cuisine-taxonomy-design.md` completely.

### Step 2: Create the Working Directory

mkdir -p docs/plans/european-cuisine-council

### Step 3: Create the Team

Create the team european-culinary-council using TeamCreate.

### Step 4: Create Tasks for Each Round

Use TaskCreate to define the deliberation phases:

- Task 1: "Round 1 — Research & Present" (no dependencies)
- Task 2: "Round 2 — Cross-Examination" (blocked by Task 1)
- Task 3: "Round 3 — Calibration & Consensus" (blocked by Task 2)
- Task 4: "Write final YAML" (blocked by Task 3)
- Task 5: "Write session retrospective" (blocked by Task 4)

### Step 5: Spawn ALL 5 Agents as Standalone Teammates

YOU are the team lead. You spawn all 5 agents directly using the Agent tool
with team_name: "european-culinary-council". Each agent is a standalone
teammate with its own context window. No agent spawns other agents.

Spawn all 5 in parallel:

1. Prof. Helena Costas (Moderator) — name: prof-helena-costas
2. Dr. Élodie Marchand (Romance & Mediterranean) — name: dr-elodie-marchand
3. Dr. Alistair Finch (British Isles) — name: dr-alistair-finch
4. Dr. Ingrid Solberg (Nordic) — name: dr-ingrid-solberg
5. Dr. Matthias Lehner (Germanic & Low Countries) — name: dr-matthias-lehner

Each specialist's prompt must include:
- Their full persona from the design doc
- Their assigned countries
- Instructions to use search-hub:search skill for research
- Instructions to send findings to prof-helena-costas via SendMessage when done
- Their teammate names so they know who else is on the council

The moderator's prompt must include:
- Her full persona from the design doc
- The complete deliberation protocol (Rounds 1-3, consensus rules, cross-border
  rule)
- The YAML schema and field definitions
- Instructions that she will RECEIVE research via SendMessage from the 4
  specialists
- She does NOT spawn anyone — she waits for messages, synthesizes, and
  coordinates
- She writes all intermediate artifacts and the final YAML
- She uses SendMessage to send cross-examination questions back to specialists
- She uses SendMessage to drive Round 3 calibration with the full council

### Step 6: Round 1 — Research & Present

The 4 specialists research their countries using search-hub and send findings
to Prof. Helena Costas via SendMessage.

Helena compiles -> writes docs/plans/european-cuisine-council/round-1-research.md

### Step 7: Round 2 — Cross-Examination

Helena sends the compiled Round 1 research to all 4 specialists via
SendMessage and asks them to cross-examine each other's findings.

Specialists send challenges/responses back to Helena (and directly to each
other).

Helena compiles -> writes
docs/plans/european-cuisine-council/round-2-cross-examination.md

### Step 8: Round 3 — Calibration & Consensus

Helena sends the side-by-side country list to all specialists and drives
consensus on scoring tiers via SendMessage exchanges.

Helena compiles -> writes
docs/plans/european-cuisine-council/round-3-calibration.md

### Step 9: Final Output

Helena writes:
- docs/taxonomy/european-cuisine-taxonomy.yaml
- docs/plans/european-cuisine-council/session-retrospective.md

## Communication Pattern

Team Lead (you)
  ├── spawns -> prof-helena-costas (moderator)
  ├── spawns -> dr-elodie-marchand (Romance & Med)
  ├── spawns -> dr-alistair-finch (British Isles)
  ├── spawns -> dr-ingrid-solberg (Nordic)
  └── spawns -> dr-matthias-lehner (Germanic & Low Countries)

Specialists --SendMessage--> prof-helena-costas (deliver research)
prof-helena-costas --SendMessage--> specialists (cross-examination questions)
Specialists --SendMessage--> each other (direct debate)
prof-helena-costas --SendMessage--> specialists (calibration requests)

NO agent spawns other agents. All communication is via SendMessage.

## Rules

- Do not summarize the design doc back to me. Read it and execute.
- Do not ask for approval between rounds. The moderator has full authority.
- All search-hub usage is unrestricted.
- Italy and Portugal entries are locked. Pre-fill exactly as specified in the
  design doc.
- The session retrospective is mandatory.
- Cross-border regions follow culinary affinity, not geography.
- Intermediate artifacts -> docs/plans/european-cuisine-council/
- Final YAML -> docs/taxonomy/

## When You're Done

Tell me:
1. How many cuisine entries are in the final YAML
2. Any countries the council added beyond the original ~18
3. The most contested debate
4. Where to find the files

Go.