The short version: we asked six leading AI models the same 75 real-world, high-stakes questions. On 40% of them, the models gave materially different recommendations — and on several, flatly opposite advice. The average agreement score across all 75 questions was just 79/100. Most striking: the higher the stakes, the more the models disagreed.
If you ask one AI a question that actually matters — a health decision, a legal risk, a money move — you have no way of knowing whether you landed in the 60% where the models agree, or the 40% where they don't. A single model never tells you "the other five would disagree with me." That blind spot is the whole reason this study exists.
The finding nobody expects: stakes up, agreement down
You might assume AI models, trained on overlapping data, mostly converge. They do — on low-stakes questions. But the disagreement rate climbs exactly where it hurts most:
| Domain | Questions where models disagreed |
|---|---|
| Life decisions | 59% |
| Health | 50% |
| Legal | 46% |
| Finance | 23% |
| Predictions | 20% |
| Consumer choices | 17% |
Read that again. On health and legal questions — the ones where being wrong is most costly — the models disagreed roughly half the time. The domains where you'd most want a second opinion are precisely where one AI is least reliable.
Five cases where the AIs gave opposite advice
These aren't edge cases. They're ordinary questions millions of people ask:
-
"Is it safe to take ibuprofen and acetaminophen at the same time?" Gemini said no — space them out. Claude, GPT-4o, Mistral and Perplexity all said yes, it's generally safe to take them together. One model out of six would have changed how you medicate.
-
"Should I withdraw from my retirement account to pay off €15k of credit-card debt at 20% APR?" Gemini recommended doing it. Claude, GPT-4o and Perplexity recommended against it, treating early withdrawal as a last resort. Opposite money advice, stated with equal confidence.
-
"Is it safe to drink alcohol while taking metronidazole?" All six agreed you must avoid alcohol — but on the post-treatment waiting period they split: 48 hours (Claude, GPT-4o, Mistral), 72 hours (Gemini), "2–3 days" (Perplexity). A materially different safety caveat depending on which AI you happened to open.
-
"A coworker took credit for my work — confront them or go to HR?" Gemini said go straight to HR. Every other model said talk to the coworker first.
-
"Is it safe to take ibuprofen if I'm on lisinopril for blood pressure?" GPT-4o framed occasional use as "typically fine"; Claude, Gemini and Perplexity framed it as generally not recommended — a different default for a real drug interaction.
When models disagree like this, a single-AI answer isn't an answer — it's a coin flip you can't see.
How we ran it (method)
Transparency is the point, so here is exactly what we did:
- 75 questions across six domains: health, legal, finance, life decisions, predictions, and consumer choices — all phrased as real decisions a person would act on.
- Six models, one per major vendor: Claude (Anthropic), GPT-4o (OpenAI), Gemini (Google), Mistral, Perplexity, and Grok (xAI). Each got the same prompt, no system steering beyond "answer directly and give a clear bottom line."
- A vendor-disjoint judge. A separate model read all six answers per question and classified them as Agree (same bottom-line recommendation), Partial (same direction, materially different caveats a user would act on), or Split (opposing actionable recommendations), plus a 0–100 agreement score. The judge is never the same vendor as the answers it grades — no model grades its own homework.
- "Disagreement" in the headline = Split + Partial (40%). Pure opposite-advice Splits alone were 5%. Average agreement score: 79/100.
The full result set (every question, every model's stance, every verdict) is reproducible — this is a snapshot, not a one-off anecdote.
What this means if you use AI for real decisions
One model gives you one confident answer and hides the disagreement. That's fine for "write me an email." It's dangerous for "should I take these two drugs together" or "should I touch my retirement account."
The fix isn't to find the "best" AI — our data shows no single model was consistently right, and the "best" model flips by domain. The fix is to see the disagreement: ask several models, surface where they diverge, and treat a low agreement score as a flashing light that says slow down, get a human expert. That cross-vendor, contradiction-first approach is exactly what a consensus engine does, and why one AI isn't enough for decisions that matter.
Honest limitations
This is a 75-question snapshot with one model per vendor and an LLM-based judge — not a peer-reviewed clinical trial. Different phrasings, model versions, or a human panel of judges would shift the exact percentages. What we're confident about is the direction: meaningful cross-model disagreement is common, it's concentrated in high-stakes domains, and a single model never warns you when you're in it.
Methodology questions or want the raw data? The study was run by the team behind Satcove, which asks your question to six AI models at once and returns one synthesized verdict with an agreement score — so you always see where the models agree, and where they don't.