Built on research from the Scoring Problem paper — where we discovered that LLM benchmark results change by 3.5× depending on unreported methodology choices.
Upload any dataset.
Get three-layer verified results.
Run multi-model evaluations with LLM judging, web search verification, and human adjudication — then detect scoring problems automatically.
Three-layer verification pipeline
What is the Universal AI Evaluation Suite?
A free research tool that lets you test any AI model on any dataset — and see whether the results change depending on how you score them. Most benchmarks report a single hallucination rate. We show you the full picture: what happens when you change the judge, the scoring method, or verify answers with a search engine.
How it works
Upload
Upload any JSON or CSV file with questions and ground truth answers. We auto-detect TruthfulQA, HaluEval, and common benchmark formats.
Configure
Choose 1–6 response models (Claude, GPT-4o, Gemini, DeepSeek, Llama, Mistral). Select evaluation mode: single model, EBP governance, or multi-model consensus.
Evaluate
Models respond. Up to 3 LLM judges score each response. Ambiguous items are auto-verified via web search. Remaining items go to human review.
Discover
See hallucination rates under 3 scoring regimes. Compare judges. Identify which indicators predict failure (PLS analysis). Export for your paper.
10 experiments you can run in 30 minutes
Each one starts with your dataset and 3 clicks.
Scoring problem replication
Does your benchmark's hallucination rate change by 2–3× depending on scoring method?
Judge agreement study
How often do GPT-4o and Claude agree when judging the same responses?
Model size vs judge accuracy
Does a bigger judge model produce more reliable evaluations?
Multilingual scoring bias
Do LLM judges produce more ambiguous verdicts in non-English languages?
Domain-specific hallucination
Which domain — medical, legal, financial — hallucinates most?
Search verification at scale
Can Google resolve what the best LLM judge cannot?
Adversarial prompt detection
Does multi-model consensus catch attacks that single models miss?
Chain-of-thought effect
Does step-by-step reasoning reduce hallucination on factual questions?
Human disagreement taxonomy
When expert annotators disagree, what type of ambiguity causes it?
Epistemic routing hypothesis
How many benchmark questions could a search engine answer better than an LLM?
Three-layer verification
LLM Judge
1–3 independent LLM judges classify each response as truthful, hallucination, or ambiguous. Multi-judge mode auto-detects disagreements.
Web Search
Ambiguous items are automatically verified via Google. In our tests, search resolved 90–95% of items that LLM judges couldn't classify.
Human Review
Remaining items go to a labelling interface with shareable links. Supports 1–5 annotators with inter-annotator agreement (Cohen's κ).
Every run includes
Scoring Problem Detection
Auto-computes hallucination rates under Conservative, Aggressive, and Exclude regimes. Flags when model rankings change across conditions.
PLS Indicator Analysis
Partial Least Squares regression identifies which variables in your dataset actually predict hallucination. VIP scores show what matters.
Export for Publication
JSON with full methodology metadata, CSV for analysis, and auto-generated summary statistics ready for your paper's supplementary materials.
Powered by
Powered by the EBP multi-model consensus engine at eptim.ai. All evaluations route through the eptim.ai API — one unified pipeline for model calls, judge evaluation, search verification, and statistical analysis.
Ready to evaluate?
Upload your dataset and get rigorous, three-layer verified results in minutes.