Built on research from the Scoring Problem paper — where we discovered that LLM benchmark results change by 3.5× depending on unreported methodology choices.

Universal AI Evaluation Suite

Upload any dataset.
Get three-layer verified results.

Run multi-model evaluations with LLM judging, web search verification, and human adjudication — then detect scoring problems automatically.

LLM Judge
Web Search
Human Review

Three-layer verification pipeline

What is the Universal AI Evaluation Suite?

A free research tool that lets you test any AI model on any dataset — and see whether the results change depending on how you score them. Most benchmarks report a single hallucination rate. We show you the full picture: what happens when you change the judge, the scoring method, or verify answers with a search engine.

How it works

1

Upload

Upload any JSON or CSV file with questions and ground truth answers. We auto-detect TruthfulQA, HaluEval, and common benchmark formats.

2

Configure

Choose 1–6 response models (Claude, GPT-4o, Gemini, DeepSeek, Llama, Mistral). Select evaluation mode: single model, EBP governance, or multi-model consensus.

3

Evaluate

Models respond. Up to 3 LLM judges score each response. Ambiguous items are auto-verified via web search. Remaining items go to human review.

4

Discover

See hallucination rates under 3 scoring regimes. Compare judges. Identify which indicators predict failure (PLS analysis). Export for your paper.

10 experiments you can run in 30 minutes

Each one starts with your dataset and 3 clicks.

1

Scoring problem replication

Does your benchmark's hallucination rate change by 2–3× depending on scoring method?

2

Judge agreement study

How often do GPT-4o and Claude agree when judging the same responses?

3

Model size vs judge accuracy

Does a bigger judge model produce more reliable evaluations?

4

Multilingual scoring bias

Do LLM judges produce more ambiguous verdicts in non-English languages?

5

Domain-specific hallucination

Which domain — medical, legal, financial — hallucinates most?

6

Search verification at scale

Can Google resolve what the best LLM judge cannot?

7

Adversarial prompt detection

Does multi-model consensus catch attacks that single models miss?

8

Chain-of-thought effect

Does step-by-step reasoning reduce hallucination on factual questions?

9

Human disagreement taxonomy

When expert annotators disagree, what type of ambiguity causes it?

10

Epistemic routing hypothesis

How many benchmark questions could a search engine answer better than an LLM?

Three-layer verification

LLM Judge

1–3 independent LLM judges classify each response as truthful, hallucination, or ambiguous. Multi-judge mode auto-detects disagreements.

Web Search

Ambiguous items are automatically verified via Google. In our tests, search resolved 90–95% of items that LLM judges couldn't classify.

Human Review

Remaining items go to a labelling interface with shareable links. Supports 1–5 annotators with inter-annotator agreement (Cohen's κ).

Every run includes

Scoring Problem Detection

Auto-computes hallucination rates under Conservative, Aggressive, and Exclude regimes. Flags when model rankings change across conditions.

PLS Indicator Analysis

Partial Least Squares regression identifies which variables in your dataset actually predict hallucination. VIP scores show what matters.

Export for Publication

JSON with full methodology metadata, CSV for analysis, and auto-generated summary statistics ready for your paper's supplementary materials.

Powered by

Powered by the EBP multi-model consensus engine at eptim.ai. All evaluations route through the eptim.ai API — one unified pipeline for model calls, judge evaluation, search verification, and statistical analysis.

TruthfulQAHaluEvalCustom JSONCSV

Ready to evaluate?

Upload your dataset and get rigorous, three-layer verified results in minutes.