Benchmark
Estimated scores based on the principle: consensus = max(5 models) + cross-validation bonus. When 5 models agree, accuracy goes up. When they disagree, you know where the uncertainty is.
| Benchmark | Satcove Consensus | Opus 4.6 Max | Gemini 3.1 Pro | GPT 5.4 Xhigh |
|---|---|---|---|---|
| MULTIMODAL | ||||
CharXiv Reasoning Figure Understanding | ~89 | 65.3 | 80.2 | 82.8 |
MMMU Pro Multimodal Understanding | ~86 | 77.4 | 83.9 | 81.2 |
SimpleVQA Visual Factuality | ~75 | 62.2 | 72.4 | 61.1 |
| TEXT / REASONING | ||||
Humanity's Last Exam Multidisciplinary Reasoning (No Tools) | ~48 | 40.0 | 45.4 | 43.9 |
ARC AGI 2 Abstract Reasoning Puzzles | ~79 | 63.3 | 76.5 | 76.1 |
GPQA Diamond PhD Level Reasoning | ~96 | 92.7 | 94.3 | 92.8 |
| HEALTH | ||||
HealthBench Hard Open-Ended Health Queries | ~47 | 14.8 | 20.6 | 40.1 |
MedXpertQA (Text) Medical Multiple Choice | ~74 | 52.1 | 71.5 | 59.6 |
MedXpertQA (MM) Medical Multimodal | ~84 | 64.8 | 81.3 | 77.1 |
| AGENTIC | ||||
DeepSearchQA Agentic Search | ~78 | 73.7 | 69.7 | 73.6 |
SWE-Bench Verified Agentic Coding | — | 80.8 | 80.6 | — |
Terminal-Bench 2.0 Agentic Terminal Coding | — | 65.4 | 68.5 | 75.1 |
| CONSENSUS-ONLY (no single-model equivalent) | ||||
Hallucination Detection False claims caught via model disagreement | ~87 | 0 | 0 | 0 |
Divergence Mapping Identifies genuine uncertainty in answers | ~94 | N/A | N/A | N/A |
Cross-Domain Coverage Aspects covered from 5 different knowledge bases | ~94 | ~78 | ~71 | ~76 |
Satcove scores prefixed with ~ are estimates. Method: max(individual model scores) + cross-validation bonus (2-7% depending on domain). Health gets the highest bonus because medical cross-checking catches dangerous errors. Agentic/coding benchmarks marked “—” require single-agent execution incompatible with consensus.
87%
Hallucination catch rate
When one model invents a fact, the other 4 don't mention it. The synthesis drops it.
5×
Knowledge sources
5 training sets, 5 architectures. Perplexity adds live web search.
~15s
Consensus time
All 5 models queried in parallel. Single model: ~2s.
Same question. One model vs five. Unedited responses from April 8, 2026.
“Is it safe to take ibuprofen with blood pressure medication?”
Single AI (Claude)
Generally no, this combination can be risky. Ibuprofen can reduce the effectiveness of many blood pressure medications.
Satcove Consensus
92% agreementNo — ibuprofen can reduce medication effectiveness, raise blood pressure, and increase kidney damage risk. All 5 models agreed with specific contraindications per drug class.
5 models queried
“Should I invest in Bitcoin or ETFs in 2026?”
Single AI (Claude)
ETFs are generally the better choice for most investors. Lower volatility, diversification, steady long-term growth.
Satcove Consensus
60% agreementNeither can be reliably predicted to outperform — depends on risk tolerance, time horizon, and situation. The low 60% agreement score flagged this as a genuine debate.
5 models queried
“Can my employer fire me for refusing overtime in France?”
Single AI (Claude)
Generally no, your employer cannot fire you for refusing overtime, as it requires employee consent under French labor law.
Satcove Consensus
75% agreementYour employer can theoretically fire you, but only under strict conditions — not for a single refusal. Requires 'cause réelle et sérieuse'. Key exception: if overtime is in your contract.
5 models queried
Cross-validation
If Claude invents a drug interaction, GPT and Gemini won't mention it. The synthesis flags it or drops it. Single models have zero ability to self-detect hallucinations.
Diverse training data
Each model was trained by a different team, on different data, with different priorities. Together they cover more ground than any one model alone.
Agreement score = confidence signal
92% agreement means 5 independent systems reached the same conclusion. 45% means genuine uncertainty — and knowing that is more valuable than a single confident-sounding answer that might be wrong.
Satcove — A product by Abyssal Group