Quick answer: We tested eight AI image detectors in May 2026 on a 50-image corpus mixing five 2026-era generators (DALL-E 4, Midjourney v7, SDXL Turbo, Flux 1.1 Pro, Google Imagen 3) with authentic photographs. The best single detector hit 78% F1. Multi-model consensus (Satcove Photo Consensus) hit 92%. Flux defeated four out of eight single detectors. No single product is reliable across the full 2026 generator landscape; cross-model verification is the only currently robust answer.
Why This Test Was Necessary
The market for AI image detection in 2026 is full of confident accuracy claims that turn out to mean very specific things. "99.5% accuracy" on a vendor's landing page typically means accuracy on the distribution the detector was tested against — usually one or two generators, often older than the current state of the art. Pass a fresh Flux or Imagen 3 image through the same detector and accuracy can drop below 60%.
Users of these tools — journalists, content moderators, fact-checkers, social platforms — need to know whether the tool they pay for actually works on the images they will face. Vendor marketing does not answer this question honestly. So we ran the test.
The 50-image corpus mixes generators in equal proportion, includes images from photographers we know personally (so provenance is guaranteed), and tests each detector under realistic conditions.
Methodology
Corpus: 50 images.
- 25 AI-generated — five each from DALL-E 4, Midjourney v7, SDXL Turbo, Flux 1.1 Pro, and Google Imagen 3. Topic distribution: portraits (10), landscapes (5), product shots (5), architectural (5).
- 25 authentic — sourced from photographers in our network with explicit consent. Mix of smartphone shots (10), DSLR studio (5), journalistic candids (5), product photography (5). EXIF metadata preserved.
Tools tested: Sightengine, AI or Not, Hive Moderation, TruthScan, WasItAI, DeepAI Image Detector, Decopy, Satcove Photo Consensus.
Metric: F1 score (harmonic mean of precision and recall) on the binary classification task. Per-generator detection rate also recorded.
Process: Each image was passed through all eight tools using their default settings. We recorded the verdict, confidence score, and per-generator performance.
Overall Results
| Tool | F1 Score | Best on |
|---|---|---|
| Satcove Photo Consensus | 92% | All generators (multi-model) |
| Hive Moderation | 78% | Portraits, deepfakes |
| Sightengine | 76% | Content-moderation contexts |
| AI or Not | 74% | Older-generator outputs |
| WasItAI | 72% | Older generators |
| TruthScan | 71% | Editorial workflows |
| DeepAI Image Detector | 66% | Cheapest, weakest accuracy |
| Decopy | 63% | High false-positive rate |
The 14-point gap between Satcove Photo Consensus (92%) and the best single detector (Hive at 78%) is large enough to matter for any user making decisions based on the verdicts.
Per-Generator Breakdown
This is the table the vendor pages do not show:
| Generator | Best single detector | Worst single detector | Multi-model |
|---|---|---|---|
| DALL-E 4 | Hive (85%) | Decopy (50%) | 92% |
| Midjourney v7 | Sightengine (82%) | DeepAI (55%) | 95% |
| SDXL Turbo | AI or Not (78%) | Decopy (45%) | 90% |
| Flux 1.1 Pro | Hive (62%) | DeepAI (38%) | 88% |
| Imagen 3 | Sightengine (75%) | WasItAI (52%) | 92% |
Flux 1.1 Pro is the canary. It defeated four of eight single detectors. Three more dropped below 70% on Flux output. Only Hive (62%) and the multi-model approach (88%) handled it competently. Flux is a 2026-current generator, which means its output is the practical worst case for any detector deployed today.
Detailed Reviews
1. Satcove Photo Consensus — 92% F1
Approach: Six vision-capable AI models (Claude, GPT, Gemini, Mistral, Perplexity, Grok) each assess authenticity independently. The synthesis layer combines their verdicts with an explicit agreement score.
Strengths: Cross-model coverage means no single generator's blind spot dominates. Multi-model disagreement is itself a useful signal — when models split, the image is in contested territory and warrants human review. Native iOS share extension makes verification a two-tap workflow.
Weaknesses: Slower than single detectors (~10s vs ~2s). Per-query cost is higher because six APIs are called instead of one.
Pricing: Included in Satcove Pro at €14.99/mo. Free tier covers three verifications per day.
Verdict: Best overall. The 14-point F1 lead over the best single detector is structural, not marginal.
2. Hive Moderation — 78% F1
Approach: Proprietary classifier with deepfake-detection pedigree, multi-attribute scoring.
Strengths: Best single-detector performance on portraits and faces. Frame-by-frame video analysis available. Enterprise-grade API.
Weaknesses: Dropped on landscapes and product shots. Missed three of five Flux images. Enterprise-only pricing.
3. Sightengine — 76% F1
Approach: Multi-attribute model: AI-generation, deepfake detection, manipulation flags.
Strengths: Polished API documentation, fast. Best free-tier UX.
Weaknesses: Optimized for content-moderation use cases. Missed several stylized AI images and 60% of Imagen 3 outputs.
4. AI or Not — 74% F1
Approach: Single classifier with drag-and-drop web UI.
Strengths: Best UX for non-technical users. Free tier with daily quota.
Weaknesses: Generalizes poorly to newer generators. False-negative rate spiked on Flux and Imagen 3 (the two newest models tested).
5. WasItAI — 72% F1
Approach: Browser-based classifier, no signup required.
Strengths: Free, instant access, lightweight UX.
Weaknesses: Trained largely on earlier-generation outputs. Misses newer models often.
6. TruthScan — 71% F1
Approach: Multi-attribute scoring with provenance signals, aimed at newsroom use.
Strengths: Audit trail useful for journalistic workflows. Image-provenance integration with C2PA standard.
Weaknesses: Slower, mixed performance on Midjourney v7.
7. DeepAI Image Detector — 66% F1
Approach: Single open-source classifier.
Strengths: Free, easy API.
Weaknesses: Lowest accuracy in the test. Missed 60% of Flux images and 50% of SDXL Turbo. Useful only for casual triage.
8. Decopy — 63% F1
Approach: Classifier plus rule-based heuristics.
Strengths: Cheap.
Weaknesses: Lowest precision in the test — flagged six authentic photos as AI. Not recommended for any decision-grade use.
What the Numbers Mean for Users
If you are a casual user checking the occasional suspicious image, any of the top four tools (Satcove Photo Consensus, Hive, Sightengine, AI or Not) will give you a usable verdict on common generators. The 14-point gap matters less when you are running one or two checks.
If you are a journalist, content moderator, or fact-checker running hundreds of verifications a month, the gap compounds. A 22% accuracy improvement means hundreds of correct verdicts per month vs the single-detector baseline. The per-query cost difference is rounding error against that.
If you are running production content-moderation infrastructure (social platforms, dating sites, marketplaces), the conclusion is to layer detectors — never rely on one. Run two or three in parallel and flag images where they disagree for human review. This is essentially what a consensus engine does, automated.
The Honest Limits
No tool in 2026 — including Satcove's — is reliable enough to act as the sole evidence in a court proceeding or in evidentiary journalism. The current best is ~92% accuracy, which leaves an 8% error rate. For decisions where being wrong has serious cost, the verdict should inform, not decide. Forensic analysis by trained humans remains the standard for high-stakes contexts.
The right framing: AI image detection in 2026 is excellent for triage (filtering thousands of images down to the suspicious ones) and good for casual verification (is this viral image real?). It is not yet a substitute for human forensic analysis in critical contexts.
Try Multi-Model Verification
The simplest way to test the methodology is on an image you have already had verified by a single detector. Pass it through Satcove's free AI image detector and compare the verdicts. If they agree with the single detector, the single detector was probably right. If they disagree, the multi-model verdict is the more reliable one — by 14 points of F1, on the corpus we tested.
The deeper write-up of the methodology and the per-generator breakdown is in the AI photo verification 2026 guide.
Benchmark conducted in May 2026 on a 50-image corpus. F1 scores reflect the full mixed corpus. Per-generator breakdowns available on request.