Methodology

Bentham Research employs a rigorous, three-phase operational pipeline designed to establish a new standard for AI humanistic reasoning. Our approach combines academic depth with technical scale to bridge the gap between AI capabilities and humanistic rigour.

The Three Operational Phases

Phase 1: Benchmark Creation

We recruit and verify elite humanities scholars across eight priority domains. These experts commission expert-authored, peer-reviewed questions with primary-source chain-of-thought rationales.

The result is the world's only humanities benchmark authored and quality-controlled exclusively by verified domain experts. While a public leaderboard establishes credibility, a private held-out dataset remains the core licensed commercial asset.

Phase 2: Frontier Model Evaluation

We deploy the benchmark against every major frontier model. Domain experts review AI answers for reasoning quality—not just correctness.

This process produces sub-domain precision gap analysis that shows where each model fails and why. The results are published in the Bentham Humanities Leaderboard, providing a clear measure of AI humanistic capability.

Phase 3: Expert RLHF & Domain Training

The verified domain expert network serves as a premium supplier of expert human feedback to frontier labs. Domain experts progress through six stages:

Question Author
Peer Reviewer
AI Answer Evaluator
RLHF Preference Annotator
Adversarial Red-Teamer
Principle Author

Each stage builds directly on the last, creating a concentration of expertise that ensures the highest quality training signals for the next generation of AI models.

The Rating Agency Thesis

Our methodology is built on the belief that AI requires an independent, standardised measure of quality, analogous to how credit rating agencies provide measures of debt quality.

By establishing the Bentham Humanities Score as the reference measure, we create a foundation where enterprise procurement, regulators, and labs can all reference a consistent, expert-validated standard of humanistic reasoning.

Domain Coverage

Our benchmark covers eight priority domains in the humanities, ensuring comprehensive evaluation of moral, ethical, and philosophical reasoning:

Philosophy

Ethics

Political Theory

Theology

Historiography

Literary Analysis

Islamic Philosophical Tradition

Jurisprudence