What is an AI Agreement Score?

A 60-second answer

An AI agreement score is the quantitative summary of how much a multi-model panel converged on the same answer. It is a single number — typically expressed as a percentage or on a labelled scale — that compresses the panel's collective behaviour into a calibrated confidence signal. High score: the models agreed; the user has strong reason to trust the convergent claims. Low score: the models split; the user has explicit information that the topic is contested or under-supported.

The score is not a "probability that the answer is true". It is a reading of how strong the multi-model signal was. A high score correlates with a higher likelihood of correctness, but the relationship is calibrated against the panel's structure, not promoted to absolute truth. The score's value is exactly in being honest about that distinction.

What the score measures

A meaningful agreement score combines three measurements.

Coverage. What fraction of the panel produced the convergent claim. Five out of six models agreeing is different from three out of six. Coverage is the simplest dimension and the easiest to communicate.

Intensity. How tightly the agreeing models matched each other. Two models agreeing word-for-word on a specific fact provide stronger evidence than two models loosely concurring on a general direction. Intensity captures the semantic tightness of the agreement.

Diversity-adjusted weight. Whether the agreement comes from genuinely independent models (high weight) or from models within the same family (lower weight, because their agreement is correlated by construction). Two Claude variants agreeing is not equivalent to a Claude and a Gemini agreeing.

A serious score combines these three dimensions into one number. A naïve score uses only coverage and treats every model equally, which inflates the score whenever the panel is internally redundant. The difference shows up in calibration: well-calibrated scores predict actual correctness rates; naïve scores over-confidently round up.

What the score is not

The agreement score is not a probability that the answer is true. It is a reading of the multi-model signal strength. The distinction matters because a high score across a panel that shares a training-data blind spot can be confidently wrong — the convergence is high, the truth is low. The score does what it is honest about: it measures agreement, not truth.

The score is also not an aggregate quality score for the models. A panel that includes a weaker model alongside several strong ones can still produce a high agreement score on questions where the weaker model gets the same easy claim right. The score reads the situation, not the participants.

Finally, the score is not a substitute for reading the actual output. A score of 92% with one model dissenting on a key claim is worth a careful read of what that one model said. The score points to the right place; the user does the reading.

How the score is calibrated

A well-calibrated agreement score is built and tested against a holdout of questions with known correct answers. The system measures: at score X%, what fraction of the panel's convergent claims were actually correct in retrospect? This produces a calibration curve that ties scores to real-world correctness rates.

Calibration matters because uncalibrated scores invite over-reliance. A 90% score that actually corresponds to 75% correctness will be trusted more than it deserves; a 90% score that corresponds to 92% correctness can be trusted at face value. Honest systems calibrate explicitly and re-calibrate as the panel evolves.

Calibration is also domain-sensitive. The score that means "highly reliable" on factual claims about widely-documented topics may mean less on questions in narrow specialties. Serious systems calibrate per domain where the data supports it, and otherwise communicate the limit honestly.

How a user should read the score

A user encountering an agreement score should treat it as one input among several.

At very high scores (typically 90%+), the convergent claims can be trusted at the level appropriate to the underlying question. Read the divergent claims (there will be a few even at high scores) — they often contain the most decision-useful detail.

At medium scores (60–85%), the panel produced useful signal but the topic is partially contested. The convergent claims are likely reliable; the divergent claims deserve direct attention. This is the range where the user does the most reading.

At low scores (under 60%), the panel did not converge in any meaningful way. The output is more a map of disagreement than an answer. The user should treat it as raw material — useful for understanding the question, not for resolving it without further investigation.

The exact thresholds depend on the system's calibration. The general principle is that the score is a guide to how to read the output, not a verdict that bypasses reading it.

Practical examples

A user asks about a well-documented historical fact. The panel produces a score of 96%. The convergent claims include dates, names, and basic context. The user reads the answer with confidence — and notices that one model added a specific detail the others omitted (a specific source). The high score made the read efficient.

A user asks about a recent regulation. The panel produces a score of 71%. The convergent claims cover the regulation's general framework; the divergent claims cover its specific application to common cases. The user reads carefully and brings the open questions to a professional. The score told them where to focus.

A user asks about a topic the panel knows poorly. The panel produces a score of 48%. The divergent claims sprawl across multiple framings. The user treats the output as an introduction to the topic's contested landscape, not as an answer to act on. The low score did its job — it kept the user from over-relying on weak collective evidence.

Common misconceptions

"A high score means the answer is true." It means the panel converged. Convergence raises confidence in correctness; it does not guarantee it.

"A low score means the system is bad." It usually means the underlying question is contested, the topic is narrow, or the panel has uneven coverage. The low score is honest reporting.

"All scores are comparable across questions." Not necessarily. A score on a factual question can be compared to other scores on factual questions. Cross-domain comparison requires per-domain calibration.

"The user should always pick the high-score answers." The user should always read the divergent claims even when the score is high — they often contain the marginal information that the convergence missed.

Related concepts

AI consensus is the broader practice the score reads off of. Multi-model verification is the engineering that produces the score. AI disagreement is the qualitative shape of the score's lower end. AI trust is the broader framing the score contributes to. AI truth-finding is the epistemic question the score helps answer.

Frequently asked questions

Is the score the probability that the answer is correct? No. It is the strength of the multi-model agreement signal. Calibration ties it to correctness rates, but it is not a direct truth probability.

Can the score be wrong? The score is a measurement; it cannot be "wrong" in isolation. It can be miscalibrated — a system that reports 90% confidence on outputs that are correct 75% of the time is miscalibrated and should be corrected.

Should I act on a 95% score the same way as on a 70% score? No. A 95% score warrants reading the dissents quickly and acting on the convergence. A 70% score warrants reading both convergence and dissents carefully before acting.

Does the score replace reading the output? No. It is a guide to how to read it, not a substitute for reading it.