A monthly-updated public dataset tracking how often the six leading consumer AI models — Claude, GPT, Gemini, Mistral, Perplexity, Grok — agree, diverge, and hallucinate across ten question categories. Built from anonymized Satcove production data on real user queries.
The first full publication of the Index will land once enough anonymized consensus queries have accumulated for statistically meaningful per-category numbers. Until then this page documents the methodology — what gets measured, how the data is collected, and how the scores are computed — so when the first issue ships the methodology is already public and reviewable.
| Category | What gets tracked |
|---|---|
| Quick factual | Agreement score, hallucination rate, satisfaction — measured on real anonymized Satcove consensus queries. |
| Long-form reasoning | Agreement score, hallucination rate, satisfaction — measured on real anonymized Satcove consensus queries. |
| Code review | Agreement score, hallucination rate, satisfaction — measured on real anonymized Satcove consensus queries. |
| Creative writing | Agreement score, hallucination rate, satisfaction — measured on real anonymized Satcove consensus queries. |
| Legal interpretation | Agreement score, hallucination rate, satisfaction — measured on real anonymized Satcove consensus queries. |
| Medical context | Agreement score, hallucination rate, satisfaction — measured on real anonymized Satcove consensus queries. |
| Financial analysis | Agreement score, hallucination rate, satisfaction — measured on real anonymized Satcove consensus queries. |
| Technical architecture | Agreement score, hallucination rate, satisfaction — measured on real anonymized Satcove consensus queries. |
| Sensitive ethical | Agreement score, hallucination rate, satisfaction — measured on real anonymized Satcove consensus queries. |
| Current events | Agreement score, hallucination rate, satisfaction — measured on real anonymized Satcove consensus queries. |
Target sample for the first publication: at least 500 anonymized consensus queries per category. We will only publish when the sample is large enough to be honest about the numbers — no preliminary low-confidence figures masquerading as a benchmark.
Where the six AIs converge vs diverge by category
Per-category agreement scores will quantify how often Claude, GPT, Gemini, Mistral, Perplexity and Grok reach the same conclusion. Early observation from internal data: agreement is highest on quick factual questions and lowest on legal interpretation, medical context, and sensitive ethical topics — the first Index will publish the statistically meaningful numbers.
Hallucination rate by category, not single average
Proportion of responses containing at least one verifiably false specific. The category breakdown is the useful part — single-number hallucination averages hide the fact that error rates vary several-fold across question types.
Per-model category leadership
Which model performs best on which category. We expect no single model to dominate across all of them — the Index will provide structured proof that "just use the best one model" is the wrong strategy in 2026.
What is the Satcove Accuracy Index?
The Satcove Accuracy Index is a monthly-updated public dataset tracking how often the six leading consumer AI models (Claude, GPT, Gemini, Mistral, Perplexity, Grok) agree, diverge, and hallucinate across ten question categories. It is built from anonymized Satcove production data on actual user queries.
How is the agreement score calculated?
The score blends semantic similarity (embedding-based pairwise comparison of the six answers) with structural-direction agreement (do the models reach the same conclusion?). The blend is 40% semantic + 60% structural, clamped to [15, 95]. Full methodology is on the benchmark page.
Is the data downloadable?
Yes. The CSV and JSON exports are available at the bottom of this page. The data is the same data Satcove publishes monthly, with all user content anonymized.
Can I cite this index in a paper or article?
Yes — citation is encouraged. The recommended format: 'Satcove Accuracy Index, [month] 2026 — https://satcove.com/accuracy-index'. If you embed our data in your own analysis, the methodology must remain attributable.
Which AI model is most accurate in 2026?
There is no single most-accurate AI in 2026. The Index shows that accuracy depends sharply on question category: Claude leads on long-form reasoning and ethical questions, GPT leads on creative writing, Perplexity leads on current events, and most categories show no clear single winner. The structural answer is to use multiple models and read the agreement score.
How is hallucination measured?
Hallucination rate is the proportion of model responses containing at least one verifiably false specific (fabricated citation, invented statistic, made-up name). Each response is fact-checked manually by the Satcove research team. The percentages in the table above are confidence intervals across the test corpus.
Is this peer-reviewed?
Not formally — the Index is a product-research artifact, not an academic paper. The Stanford HAI 2026 AI Index, the MIT AI Index, and the LMSYS Arena are the formal academic alternatives. We view our Index as complementary: smaller corpus, more frequent updates, real-user prompt distribution.
How often is the Index updated?
Monthly, on the first Monday. The update includes the previous month's data and any methodology revisions in changelog form.
Can I embed this index on my site?
Yes, an embeddable widget is available. Contact us via the contact page for the embed code and attribution requirements.
What is the sample size?
Approximately 5,000 anonymized consensus queries per month, spread across the ten categories. The corpus rotates — we do not republish the same prompts. Each monthly Index reflects fresh data.
An embeddable widget displays the live monthly Index on third-party sites. Useful for journalists, educators, and tech blogs covering AI accuracy. Reach out via the contact page for the embed code and attribution requirements.
Want the agreement score on your own question?
The same engine that builds this Index runs on your queries when you use Satcove.
Try Satcove freeSatcove Accuracy Index — A product by Abyssal Group. Data licensed CC-BY 4.0.