EU researchers are warning about the challenges of measuring AI capabilities and are urging regulators to ensure that the performance indicators claimed by AI companies are accurate, reports 24brussels.
A new paper released last week by the Commission’s Joint Research Centre indicates that current AI benchmarks overstate the potential of these technologies. The authors argue that many proprietary tools used to compare AI models are easily manipulated and fail to measure the appropriate metrics.
Companies within the AI sector utilize these benchmarks to quantify their models’ performance on specific tasks. For instance, OpenAI’s recent testing of its GPT-5 model shows it outperforming its predecessor in abstaining from responding to unanswerable questions.
However, EU researchers stress the importance of regulators meticulously evaluating the functionality of these benchmarking tools.
Benchmarking AI poses a significant issue for the EU, as its regulatory framework for artificial intelligence hinges on effectively assessing model capabilities across various contexts. Large models can be classified as presenting unique risks under the EU’s AI legislation based on benchmarks that determine their “high impact capabilities.”
The law grants the Commission authority to define these terms through delegated legislation—a task yet to be addressed by the EU’s executive branch.
Meanwhile, last Friday, the U.S. government launched a suite of evaluation tools intended for use by government agencies to assess AI systems. The U.S. AI Action Plan articulates a commitment to bolster American leadership in the AI field.
Which AI benchmarks to trust?
EU researchers advocate that policymakers ensure benchmarks assess capabilities relevant to real-world applications rather than focusing solely on narrow tasks; they should be well-documented and transparent, clearly delineating what is measured and how it is evaluated while incorporating diverse cultural contexts.
Additionally, the paper highlights that many existing benchmarks disproportionately emphasize the English language.
“We especially identify a need for new ways of signaling what benchmarks to trust,” the authors assert.
When executed effectively, the researchers propose that policymakers have an opportunity to establish a new type of “Brussels effect.”
Risto Uuk, head of EU policy and research at the AI-focused think tank Future of Life Institute, expressed agreement with the paper’s concerns, suggesting that the EU should mandate third-party evaluations and support the establishment of a comprehensive AI evaluation ecosystem.
“Improvements are necessary, but evaluating capabilities and other aspects of risks and benefits is crucial, and simply relying on vibes and anecdotes is not enough,” he concluded.