University of Oxford: Who Should Develop Which AI Evaluations?
The memo proposes a framework for assigning AI evaluation development to various actors—government, contractors, third-party organizations, and AI companies—by using four approaches and nine criteria that balance risk, method requirements, and conflicts of interest, while advocating for a market-based ecosystem to support high-quality evaluations.
University of Oxford: Who Should Develop Which AI Evaluations?
Summary of Read Full Report (PDF)
This research memo examines the optimal actors for developing AI model evaluations, considering conflicts of interest and expertise requirements. It proposes a taxonomy of four development approaches (government-led, government-contractor collaborations, third-party grants, and direct AI company development) and nine criteria for selecting developers.
The authors suggest a two-step sorting process to identify suitable developers and recommend measures for a market-based ecosystem fostering diverse, high-quality evaluations, emphasizing a balance between public accountability and private-sector efficiency.
The memo also explores challenges like information sensitivity, model access, and the blurred boundaries between evaluation development, execution, and interpretation. Finally, it proposes several strategies for creating a sustainable market for AI model evaluations.
The authors of this document are Lara Thurnherr, Robert Trager, Amin Oueslati, Christoph Winter, Cliodhna Ní Ghuidhir, Joe O'Brien, Jun Shern Chan, Lorenzo Pacchiardi, Anka Reuel, Merlin Stein, Oliver Guest, Oliver Sourbut, Renan Araujo, Seth Donoughe, and Yi Zeng.
Here are five of the most impressive takeaways from the document:
- A variety of actors could develop AI evaluations, including government bodies, academics, third-party organizations, and AI companies themselves. Each of these actors have different characteristics, and different strengths and weaknesses. The document outlines a framework for deciding which of these actors is best suited to develop specific AI evaluations, based on risk and method criteria.
- There are four main approaches to developing AI evaluations: AI Safety Institutes (AISIs) developing evaluations independently, AISIs collaborating with contracted experts, funding third parties for independent development, and AI companies developing their own evaluations. Each approach has its own advantages and disadvantages. For instance, while AI companies developing their own evaluations might be cost-effective and leverage their expertise, this approach may create a conflict of interest.
- Nine criteria can help determine who should develop specific evaluations. These criteria are divided into risk-related and method-related categories. Risk-related criteria include required risk-related skills and expertise, information sensitivity and security clearances, evaluation urgency, and risk prevention incentives. Method-related criteria include the level of model access required, evaluation development costs, required method-related skills and expertise, and verifiability and documentation.
- A market-based ecosystem for AI evaluations is crucial for long-term success. This ecosystem could be supported by measures such as developing and publishing tools, establishing standards and best practices, providing legal certainty and accreditation for third-party evaluators, brokering relationships between third parties and AI companies, and mandating information sharing on evaluation development. Public bodies could also offer funding and computational resources to academic researchers interested in developing evaluations.
- The decision of who develops AI evaluations is complex and depends on the specific context. The document emphasizes the importance of considering multiple factors, including the risk being assessed, the methods used, the capabilities of the potential developers, and the potential for conflicts of interest. It suggests that a systematic approach to decision-making can improve the overall quality and effectiveness of AI evaluations.
Related Articles
Gemini 3.1 Pro and the Case for Model-Agnostic Agentic Infrastructure
Google's Gemini 3.1 Pro doubled its reasoning benchmarks overnight. Here's why that makes model-agnostic agentic infrastructure more critical than ever.
Google Gemini 3.1 Pro, ChatGPT Ads, and Why Organizations Need to Own Their AI Infrastructure
Google launches Gemini 3.1 Pro with advanced reasoning while OpenAI rolls out ads in ChatGPT. These two moves reveal a growing tension in enterprise AI: who controls the intelligence layer, and whose interests does it serve?
ChatGPT Now Has Ads — And It Should Change How You Think About AI Infrastructure
OpenAI has started showing ads inside ChatGPT responses. This marks a turning point: organizations relying on consumer AI tools are now subject to someone else's monetization strategy. Here's why owning your AI infrastructure matters more than ever.
Gemini 3.1 Pro Just Dropped — Here's What It Means for Organizations Running Their Own AI
Google's Gemini 3.1 Pro launched today with 1M-token context, native multimodal reasoning, and agentic tool use. Here's why model releases like this one matter most to organizations that own their AI infrastructure — and why locking into a single provider is the costliest mistake you can make.
See the ibl.ai AI Operating System in Action
Discover how leading universities and organizations are transforming education with the ibl.ai AI Operating System. Explore real-world implementations from Harvard, MIT, Stanford, and users from 400+ institutions worldwide.
View Case StudiesGet Started with ibl.ai
Choose the plan that fits your needs and start transforming your educational experience today.