Skip to content
← All casesConsensus PatternAI

Multi-Model Consensus

4 LLM providers × 2-stage deliberation — 28% disagreement rate catches errors single-model misses

Problem

What doesn't work

When a single LLM makes critical business decisions, errors are inevitable: hallucinations, bias toward its own patterns, no self-verification. In A/B test on 150 decisions, single-model gave 22% false approvals — every 5th decision was wrong.

Solution

Architectural approach

A panel of 4 independent providers (OpenAI o4-mini, Claude Opus + thinking, Gemini 2.5 Pro + thinking, DeepSeek Reasoner) evaluates in parallel. Decision by quorum ≥3/4. In round 2, models see each other's arguments and refine their positions.

Challenges

What made it hard

Each LLM provider returns responses in its own format — unified parsing took more time than the consensus algorithm itself. Cost of 4 parallel calls required careful optimization: balancing cheap models for volume and expensive ones for critical decisions. Latency of 4 parallel APIs is unpredictable — graceful degradation needed when one provider times out.

Role

My role & contribution

Architect & sole developer

Designed and implemented the entire architecture: selection of 4 providers with different strengths (reasoning tokens, thinking blocks), ≥3/4 quorum protocol, 2-stage deliberation with argument exchange, MIN_PROVIDERS=2 fault tolerance. Conducted A/B test: 150 decisions single-model vs consensus — false approvals dropped from 22% to 13%.

Demo

How it looks

Screenshots

Real screenshots

Architecture

System architecture

OpenAI o4-miniClaude OpusGemini 2.5 ProDeepSeek R1asyncio.gather()parallel callsAggregatorcount votesQuorum3/4APPROVEDorREJECTEDRound 2: models see each other's argumentsMulti-Model Consensus Protocolvotevote
Implementation

How it works

Async calls via asyncio.gather() to 4 APIs. Each model leverages its strengths (reasoning tokens, thinking blocks). Results aggregated with MIN_PROVIDERS=2 threshold for fault tolerance. 2-stage deliberation: round 1 — independent votes, round 2 — argument exchange.

Architecture Decision

Why this way

2-stage deliberation instead of 1-pass vote

Alternative

Parallel voting without argument exchange (as in v1)

Why it didn't fit

1-pass: models vote in a vacuum. If Gemini rejected due to false positive — no chance to reconsider. 2-stage: models see each other's arguments and can refine.

Result

Fewer false rejections, higher decision quality. Cost: +1 API round (~$0.02)

Metrics

Results

01
A/B test: 150 decisions, false approvals 22% → 13%
02
Disagreement rate between models: ~28%
03
Quorum ≥3/4 for approval
04
$0.01–0.05 per decision (4 models in parallel)
05
Used in board-svc (go/no-go) and review-svc (code review)
Business Impact

Impact on business

A/B test on 150 decisions: false approvals dropped from 22% to 13%. ~28% disagreement rate between models — these are cases where at least one model catches a problem others miss. Consensus cost: $0.01–0.05 per decision.

Methods

Algorithms & patterns

Multi-annotator votingLLM-as-JudgeStructured JSON outputMulti-provider consensusA/B Testing (single vs consensus)asyncio.gather()
Stack

Technologies

  • Python
  • asyncio
  • OpenAI API
  • Anthropic API
  • Google GenAI
  • DeepSeek API

Ready to discuss?

If you need an architect who builds autonomous AI systems — reach out.

Serbia-based · CET/CEST timezone · EU-aligned working hours · International contracts experience