Wout Willy M. Schellaert
Artificial intelligence (AI) is becoming more and more relevant to our daily lives. Countless products and services rely on some sort of machine cognition to function. Given this increased relevance, the evaluation of AI systems, i.e. measuring how well they perform or how intelligent they are, is however not in the state we would like it to be. Both in practice and in theory, evaluation is lacking in various ways. Performance estimations suffer in accuracy under distribution shift and test set contamination. Given a new instance of a task, most evaluation procedures do not provide a granular performance estimate tailored to that instance. And with the advent of general-purpose AI (GPAI), the scope of evaluation has increased significantly, associating a heavy data and logistical burden with common evaluation methodology, and entailing an unknown distribution of tasks to measure performance for.
Motivated by the importance of prediction in the philosophy of science, the relevance of prediction to the practices of evaluation in other fields such as animal cognition and psychometrics, and the role of prediction as a central language in the field of machine learning itself, we frame the evaluation of artificial intelligence as a prediction problem. We develop a formal framework conceptualising evaluation procedures as learning algorithms that produce performance-predicting models from empirical data. This framework then helped us reframe the notion evaluation in the language of machine learning, providing a solution to the refinement problem by conditioning evaluation results on the input variable, which also partially addresses the distribution shift problem. For challenges that remain unsolved. e.g. out-of-distribution prediction and the large scope of evaluating GPAI systems, clear analogies to machine learning literature can be drawn, providing a way forward through scaling up data, focusing on generalisation, or introducing meta-learning for evaluation.
Additionally, two empirical studies are presented. The first is an investigation of "assessor models", a newly developed machine learning technique for granularly predicting AI system performance, where the experimental setting is focused on large language models (LLMs) and the factors influencing the predictability of their performance. We find that refined instance-level score estimation is possible, out-of-distribution score estimation is consistently hard, and that the use of multi-task and multi-system data can improve evaluation accuracy The second study again investigates language model performance, but now from the perspective of human users doing score prediction in their direct interaction with AI systems. Starting from human-derived notions of difficulty, we analyse performance, question avoidance, prompt sensitivity, and human supervision across five different tasks over varying difficulty levels. We also add two other dimensions: that of scaling language models, i.e. making them bigger in terms of parameters and ingested data, and that of shaping them, making them more instructable and easier to use. Over more than thirty models from three different language model families, we find that while human difficulty correlates well with LLM performance, there is too little of a step shape in the difficulty-performance plot to facilitate confident prediction. There is no region --regardless how low the difficulty-- where performance is perfect, as newer models improve by gaining ground on the medium and hard instances, rather than the easy ones.
We conclude that a predictive interpretation of evaluation brings together different techniques and applications of evaluation. By drawing from the rich literature of machine learning and statistics, we can solve several fundamental issues in evaluation theory, while making progress in several others, allowing the science and practice of evaluation to more naturally co-evolve with the advances of artificial intelligence.
© 2008-2025 Fundación Dialnet · Todos los derechos reservados