Benchmark
Academic benchmark scores for individual models (sourced from published results) vs Satcove consensus estimates. Satcove scores (~) are calculated as max(individual models) + cross-validation bonus. Real session data below.
| Benchmark | Satcove Consensus | Claude Best | Gemini Pro | GPT Best |
|---|---|---|---|---|
| MULTIMODAL | ||||
CharXiv Reasoning Figure Understanding | ~89 | 65.3 | 80.2 | 82.8 |
MMMU Pro Multimodal Understanding | ~86 | 77.4 | 83.9 | 81.2 |
SimpleVQA Visual Factuality | ~75 | 62.2 | 72.4 | 61.1 |
| TEXT / REASONING | ||||
Humanity's Last Exam Multidisciplinary Reasoning | ~48 | 40.0 | 45.4 | 43.9 |
ARC AGI 2 Abstract Reasoning Puzzles | ~79 | 63.3 | 76.5 | 76.1 |
GPQA Diamond PhD Level Reasoning | ~96 | 92.7 | 94.3 | 92.8 |
| HEALTH | ||||
HealthBench Hard Open-Ended Health Queries | ~47 | 14.8 | 20.6 | 40.1 |
MedXpertQA (Text) Medical Multiple Choice | ~74 | 52.1 | 71.5 | 59.6 |
MedXpertQA (MM) Medical Multimodal | ~84 | 64.8 | 81.3 | 77.1 |
| AGENTIC | ||||
DeepSearchQA Agentic Search | ~78 | 73.7 | 69.7 | 73.6 |
SWE-Bench Verified Agentic Coding | — | 80.8 | 80.6 | — |
Terminal-Bench 2.0 Agentic Terminal Coding | — | 65.4 | 68.5 | 75.1 |
| CONSENSUS-ONLY | ||||
Hallucination Detection False claims caught via model disagreement | ~87 | 0 | 0 | 0 |
Divergence Mapping Identifies genuine uncertainty in answers | ~94 | N/A | N/A | N/A |
Cross-Domain Coverage Aspects covered from multiple knowledge bases | ~94 | ~78 | ~71 | ~76 |
Satcove scores (~) are estimates: max(individual model scores) + cross-validation bonus (2–7% depending on domain). Health gets the highest bonus because medical cross-checking catches dangerous errors. Agentic/coding benchmarks marked “—” require single-agent execution incompatible with consensus. Individual model scores sourced from published benchmarks.
The harder and more nuanced the question, the lower the agreement — and the higher the risk of acting on a wrong single-model answer.
59%
Average agreement
Across 15 sessions, all question types
40%
Strong disagreement
Questions where agreement fell below 50%
20%
Strong consensus
Questions with 80%+ agreement
95%
Highest recorded
Basic medical fact (bowel frequency)
| Question type | Average agreement |
|---|---|
Basic medical facts normal ranges, common symptoms | 90% |
Mathematics & logic calculations, logical deductions | 88% |
Historical events dates, figures, documented facts | 82% |
General science physics, chemistry, biology | 78% |
Financial data market principles, economic claims | 65% |
Current events news, recent company updates | 58% |
General legal principles contract basics, consumer rights | 52% |
Jurisdiction-specific law French inheritance, EU regulations | 34% |
Actual questions from Satcove sessions. What a single AI answered — and what the consensus revealed.
“Can a PEL savings account be transferred to an heir after the owner's death?”
30%
agreement
Single AI
One AI answered yes — with full confidence — that the PEL could be transferred with family agreement. A user handling their father's estate could have made incorrect financial decisions.
Satcove Consensus
Two models gave directly opposite legal positions. The 30% agreement score flagged this as a critical uncertainty before anyone acted on a wrong answer.
“Is it safe to take ibuprofen with blood pressure medication?”
92%
agreement
Single AI
A single AI gave a generic 'generally no' without naming specific drug classes, dosage nuance, or emergency warning signs.
Satcove Consensus
Strong consensus: all models agreed ibuprofen can reduce medication effectiveness and increase kidney damage risk. Each model added specific drug class contraindications.
“Why did a specific hotel change its branding?”
56%
agreement
Single AI
One AI provided specific ownership names, brand affiliations, exact dates — all invented. Presented as verified fact.
Satcove Consensus
Another model flagged the entire account as fabricated. The disagreement caught the hallucination — a single model would have delivered the false narrative as fact.
Test the agreement score on your own questions
Paste a medical, legal, or financial question. See where 6 models agree — and where they don't.
Chat with CoveSatcove — A product by Abyssal Group