Benchmark

Satcove Consensus vs single-model AI

Academic benchmark scores for individual models (sourced from published results) vs Satcove consensus estimates. Satcove scores (~) are calculated as max(individual models) + cross-validation bonus. Real session data below.

BenchmarkSatcove
Consensus
Claude
Best
Gemini
Pro
GPT
Best
MULTIMODAL

CharXiv Reasoning

Figure Understanding

~8965.380.282.8

MMMU Pro

Multimodal Understanding

~8677.483.981.2

SimpleVQA

Visual Factuality

~7562.272.461.1
TEXT / REASONING

Humanity's Last Exam

Multidisciplinary Reasoning

~4840.045.443.9

ARC AGI 2

Abstract Reasoning Puzzles

~7963.376.576.1

GPQA Diamond

PhD Level Reasoning

~9692.794.392.8
HEALTH

HealthBench Hard

Open-Ended Health Queries

~4714.820.640.1

MedXpertQA (Text)

Medical Multiple Choice

~7452.171.559.6

MedXpertQA (MM)

Medical Multimodal

~8464.881.377.1
AGENTIC

DeepSearchQA

Agentic Search

~7873.769.773.6

SWE-Bench Verified

Agentic Coding

80.880.6

Terminal-Bench 2.0

Agentic Terminal Coding

65.468.575.1
CONSENSUS-ONLY

Hallucination Detection

False claims caught via model disagreement

~87000

Divergence Mapping

Identifies genuine uncertainty in answers

~94N/AN/AN/A

Cross-Domain Coverage

Aspects covered from multiple knowledge bases

~94~78~71~76

Satcove scores (~) are estimates: max(individual model scores) + cross-validation bonus (2–7% depending on domain). Health gets the highest bonus because medical cross-checking catches dangerous errors. Agentic/coding benchmarks marked “—” require single-agent execution incompatible with consensus. Individual model scores sourced from published benchmarks.

Real agreement scores — 15 Satcove sessions

The harder and more nuanced the question, the lower the agreement — and the higher the risk of acting on a wrong single-model answer.

59%

Average agreement

Across 15 sessions, all question types

40%

Strong disagreement

Questions where agreement fell below 50%

20%

Strong consensus

Questions with 80%+ agreement

95%

Highest recorded

Basic medical fact (bowel frequency)

Question typeAverage agreement

Basic medical facts

normal ranges, common symptoms

90%

Mathematics & logic

calculations, logical deductions

88%

Historical events

dates, figures, documented facts

82%

General science

physics, chemistry, biology

78%

Financial data

market principles, economic claims

65%

Current events

news, recent company updates

58%

General legal principles

contract basics, consumer rights

52%

Jurisdiction-specific law

French inheritance, EU regulations

34%

Real questions. Real disagreements.

Actual questions from Satcove sessions. What a single AI answered — and what the consensus revealed.

Legal

Can a PEL savings account be transferred to an heir after the owner's death?

30%

agreement

Single AI

One AI answered yes — with full confidence — that the PEL could be transferred with family agreement. A user handling their father's estate could have made incorrect financial decisions.

Satcove Consensus

Two models gave directly opposite legal positions. The 30% agreement score flagged this as a critical uncertainty before anyone acted on a wrong answer.

Health

Is it safe to take ibuprofen with blood pressure medication?

92%

agreement

Single AI

A single AI gave a generic 'generally no' without naming specific drug classes, dosage nuance, or emergency warning signs.

Satcove Consensus

Strong consensus: all models agreed ibuprofen can reduce medication effectiveness and increase kidney damage risk. Each model added specific drug class contraindications.

Hallucination

Why did a specific hotel change its branding?

56%

agreement

Single AI

One AI provided specific ownership names, brand affiliations, exact dates — all invented. Presented as verified fact.

Satcove Consensus

Another model flagged the entire account as fabricated. The disagreement caught the hallucination — a single model would have delivered the false narrative as fact.

Test the agreement score on your own questions

Paste a medical, legal, or financial question. See where 6 models agree — and where they don't.

Chat with Cove

Satcove — A product by Abyssal Group