Multi-Model Consensus
4 LLM providers × 2-stage deliberation — 28% disagreement rate catches errors single-model misses
What doesn't work
When a single LLM makes critical business decisions, errors are inevitable: hallucinations, bias toward its own patterns, no self-verification. In A/B test on 150 decisions, single-model gave 22% false approvals — every 5th decision was wrong.
Architectural approach
A panel of 4 independent providers (OpenAI o4-mini, Claude Opus + thinking, Gemini 2.5 Pro + thinking, DeepSeek Reasoner) evaluates in parallel. Decision by quorum ≥3/4. In round 2, models see each other's arguments and refine their positions.
What made it hard
Each LLM provider returns responses in its own format — unified parsing took more time than the consensus algorithm itself. Cost of 4 parallel calls required careful optimization: balancing cheap models for volume and expensive ones for critical decisions. Latency of 4 parallel APIs is unpredictable — graceful degradation needed when one provider times out.
My role & contribution
Architect & sole developer
Designed and implemented the entire architecture: selection of 4 providers with different strengths (reasoning tokens, thinking blocks), ≥3/4 quorum protocol, 2-stage deliberation with argument exchange, MIN_PROVIDERS=2 fault tolerance. Conducted A/B test: 150 decisions single-model vs consensus — false approvals dropped from 22% to 13%.
How it looks
Real screenshots
System architecture
How it works
Async calls via asyncio.gather() to 4 APIs. Each model leverages its strengths (reasoning tokens, thinking blocks). Results aggregated with MIN_PROVIDERS=2 threshold for fault tolerance. 2-stage deliberation: round 1 — independent votes, round 2 — argument exchange.
Why this way
2-stage deliberation instead of 1-pass vote
Parallel voting without argument exchange (as in v1)
1-pass: models vote in a vacuum. If Gemini rejected due to false positive — no chance to reconsider. 2-stage: models see each other's arguments and can refine.
Fewer false rejections, higher decision quality. Cost: +1 API round (~$0.02)
Results
- 01
- A/B test: 150 decisions, false approvals 22% → 13%
- 02
- Disagreement rate between models: ~28%
- 03
- Quorum ≥3/4 for approval
- 04
- $0.01–0.05 per decision (4 models in parallel)
- 05
- Used in board-svc (go/no-go) and review-svc (code review)
Impact on business
A/B test on 150 decisions: false approvals dropped from 22% to 13%. ~28% disagreement rate between models — these are cases where at least one model catches a problem others miss. Consensus cost: $0.01–0.05 per decision.
Algorithms & patterns
Technologies
- Python
- asyncio
- OpenAI API
- Anthropic API
- Google GenAI
- DeepSeek API