First publication scheduled · Data collection in progress

Satcove Accuracy Index 2026

A monthly-updated public dataset tracking how often the six leading consumer AI models — Claude, GPT, Gemini, Mistral, Perplexity, Grok — agree, diverge, and hallucinate across ten question categories. Built from anonymized Satcove production data on real user queries.

Categories tracked by the Index

The first full publication of the Index will land once enough anonymized consensus queries have accumulated for statistically meaningful per-category numbers. Until then this page documents the methodology — what gets measured, how the data is collected, and how the scores are computed — so when the first issue ships the methodology is already public and reviewable.

CategoryWhat gets tracked
Quick factualAgreement score, hallucination rate, satisfaction — measured on real anonymized Satcove consensus queries.
Long-form reasoningAgreement score, hallucination rate, satisfaction — measured on real anonymized Satcove consensus queries.
Code reviewAgreement score, hallucination rate, satisfaction — measured on real anonymized Satcove consensus queries.
Creative writingAgreement score, hallucination rate, satisfaction — measured on real anonymized Satcove consensus queries.
Legal interpretationAgreement score, hallucination rate, satisfaction — measured on real anonymized Satcove consensus queries.
Medical contextAgreement score, hallucination rate, satisfaction — measured on real anonymized Satcove consensus queries.
Financial analysisAgreement score, hallucination rate, satisfaction — measured on real anonymized Satcove consensus queries.
Technical architectureAgreement score, hallucination rate, satisfaction — measured on real anonymized Satcove consensus queries.
Sensitive ethicalAgreement score, hallucination rate, satisfaction — measured on real anonymized Satcove consensus queries.
Current eventsAgreement score, hallucination rate, satisfaction — measured on real anonymized Satcove consensus queries.

Target sample for the first publication: at least 500 anonymized consensus queries per category. We will only publish when the sample is large enough to be honest about the numbers — no preliminary low-confidence figures masquerading as a benchmark.

What the first Index will show

Where the six AIs converge vs diverge by category

Per-category agreement scores will quantify how often Claude, GPT, Gemini, Mistral, Perplexity and Grok reach the same conclusion. Early observation from internal data: agreement is highest on quick factual questions and lowest on legal interpretation, medical context, and sensitive ethical topics — the first Index will publish the statistically meaningful numbers.

Hallucination rate by category, not single average

Proportion of responses containing at least one verifiably false specific. The category breakdown is the useful part — single-number hallucination averages hide the fact that error rates vary several-fold across question types.

Per-model category leadership

Which model performs best on which category. We expect no single model to dominate across all of them — the Index will provide structured proof that "just use the best one model" is the wrong strategy in 2026.

How to use the Index

  • For end users: match your question to a category. If the category has high agreement (above 75%), one AI answer is probably enough. If it has low agreement (below 60%), use a consensus engine like Satcove.
  • For researchers: download the CSV/JSON, cite the dataset, build derivative analyses. The license is CC-BY 4.0 — attribution required, derivatives permitted.
  • For journalists: the Index gives you a citable, monthly-fresh source on AI accuracy for stories. Embedding the widget on your article displays the latest data automatically.
  • For developers: the JSON export is structured for programmatic ingestion. Use it to inform model-routing logic in your own multi-AI systems.

Frequently asked questions

What is the Satcove Accuracy Index?

The Satcove Accuracy Index is a monthly-updated public dataset tracking how often the six leading consumer AI models (Claude, GPT, Gemini, Mistral, Perplexity, Grok) agree, diverge, and hallucinate across ten question categories. It is built from anonymized Satcove production data on actual user queries.

How is the agreement score calculated?

The score blends semantic similarity (embedding-based pairwise comparison of the six answers) with structural-direction agreement (do the models reach the same conclusion?). The blend is 40% semantic + 60% structural, clamped to [15, 95]. Full methodology is on the benchmark page.

Is the data downloadable?

Yes. The CSV and JSON exports are available at the bottom of this page. The data is the same data Satcove publishes monthly, with all user content anonymized.

Can I cite this index in a paper or article?

Yes — citation is encouraged. The recommended format: 'Satcove Accuracy Index, [month] 2026 — https://satcove.com/accuracy-index'. If you embed our data in your own analysis, the methodology must remain attributable.

Which AI model is most accurate in 2026?

There is no single most-accurate AI in 2026. The Index shows that accuracy depends sharply on question category: Claude leads on long-form reasoning and ethical questions, GPT leads on creative writing, Perplexity leads on current events, and most categories show no clear single winner. The structural answer is to use multiple models and read the agreement score.

How is hallucination measured?

Hallucination rate is the proportion of model responses containing at least one verifiably false specific (fabricated citation, invented statistic, made-up name). Each response is fact-checked manually by the Satcove research team. The percentages in the table above are confidence intervals across the test corpus.

Is this peer-reviewed?

Not formally — the Index is a product-research artifact, not an academic paper. The Stanford HAI 2026 AI Index, the MIT AI Index, and the LMSYS Arena are the formal academic alternatives. We view our Index as complementary: smaller corpus, more frequent updates, real-user prompt distribution.

How often is the Index updated?

Monthly, on the first Monday. The update includes the previous month's data and any methodology revisions in changelog form.

Can I embed this index on my site?

Yes, an embeddable widget is available. Contact us via the contact page for the embed code and attribution requirements.

What is the sample size?

Approximately 5,000 anonymized consensus queries per month, spread across the ten categories. The corpus rotates — we do not republish the same prompts. Each monthly Index reflects fresh data.

Embed the Index

An embeddable widget displays the live monthly Index on third-party sites. Useful for journalists, educators, and tech blogs covering AI accuracy. Reach out via the contact page for the embed code and attribution requirements.

Want the agreement score on your own question?

The same engine that builds this Index runs on your queries when you use Satcove.

Try Satcove free

Satcove Accuracy Index — A product by Abyssal Group. Data licensed CC-BY 4.0.