The AI Progress Narrative: Hype vs. Reality
Generative AI companies have masterfully crafted a narrative of rapid, unprecedented progress in the field of artificial intelligence. Just last week, OpenAI unveiled GPT-4.5, touting it as its "largest and best model for chat yet." Earlier in February, Google proclaimed its latest Gemini model as "the world’s best AI model." In January, the Chinese company DeepSeek claimed its R1 model was as powerful as OpenAI’s o1 model, which Sam Altman, the CEO of OpenAI, had previously described as "the smartest model in the world." These claims are made with such confidence that they have become a political and economic issue, with vast resources—land, power, and money—being allocated to drive this technology forward. Yet, there is growing evidence that progress is slowing down, and the Large Language Model (LLM)-powered chatbot may already be nearing its peak. This raises critical questions: How much is AI actually improving? How much better can it get? These questions are not just curious musings; they are essential to understanding the direction of this technology and the massive investments being made in it. Yet, answering them is nearly impossible because the very tools used to measure AI progress are flawed.
The Promise of Generalization: A Tricky Metric
Unlike traditional computer programs, which are designed to provide precise answers to specific questions, generative AI is built to generalize. A chatbot should be able to answer questions it hasn’t been explicitly trained on, much like a human student who learns not just that 2 x 3 = 6, but how to multiply any two numbers. This ability to reason and generalize is what AI companies promise will revolutionize science, education, and countless other fields. However, generalization is incredibly difficult to measure, and even harder to prove. To demonstrate the success of their models, companies rely on industry-standard benchmark tests. These tests are designed to include questions the models haven’t seen before, supposedly proving that they’re not just memorizing facts but truly understanding and reasoning.
Benchmark Contamination: A Growing Problem
Over the past two years, however, researchers have published studies revealing a serious issue: many of the benchmark tests used to evaluate these models have been contaminated. Models like ChatGPT, DeepSeek, Llama, Mistral, Gemini, Phi, and Qwen have all been trained on the text of popular benchmark tests, effectively tainting the legitimacy of their scores. It’s akin to a student memorizing a math test after stealing it, fooling their teacher into thinking they’ve mastered long division. This phenomenon, known as benchmark contamination, is so widespread that one industry newsletter concluded in October that "Benchmark Tests Are Meaningless." Despite this, AI companies continue to cite these tests as the primary indicators of progress. A spokesperson for Google DeepMind told The Atlantic that the company takes the issue seriously and is actively exploring new ways to evaluate its models. No other company mentioned in the article commented on the issue.
How Benchmark Contamination Happens
Benchmark contamination isn’t necessarily intentional. Most benchmarks are publicly available on the internet, and language models are trained on vast amounts of text harvested from the web. The training datasets are so large that finding and filtering out the benchmarks is extremely challenging. When Microsoft launched a new language model in December, a researcher on the team boasted about "aggressively" removing benchmarks from the training data. Yet, the model’s technical report admitted that their methods were "not effective against all scenarios." One of the most commonly cited benchmarks is the Massive Multitask Language Understanding (MMLU) test, which consists of roughly 16,000 multiple-choice questions covering 57 subjects, including anatomy, philosophy, marketing, nutrition, religion, math, and programming. Over the past year, OpenAI, Google, Microsoft, Meta, and DeepSeek have all advertised their models’ scores on MMLU, but researchers have shown that models from all these companies have been trained on its questions.
The Challenges of Measuring AI Progress
How do researchers know that even "closed" models, like OpenAI’s, have been trained on benchmarks? They use creative techniques to uncover the truth. One research team took questions from MMLU and asked ChatGPT not for the correct answers but for specific incorrect multiple-choice options. ChatGPT was able to provide the exact text of incorrect answers on MMLU 57 percent of the time, something it likely couldn’t do unless it had been trained on the test, as the options are chosen from an infinite number of possible wrong answers. Another team of researchers from Microsoft and Xiamen University investigated GPT-4’s performance on questions from Codeforces programming competitions. GPT-4 performed well on questions published online before September 2021 but struggled with questions published after that date. Since GPT-4 was trained only on data from before September 2021, the researchers suggested that it had memorized the questions, casting doubt on its actual reasoning abilities. Other researchers have shown that GPT-4’s performance on coding questions is better for questions that appear more frequently online, further supporting the hypothesis that it is memorizing rather than reasoning.
Can Benchmark Contamination Be Solved?
Solving the benchmark contamination problem is no easy task. Some have suggested constantly updating benchmarks with questions based on new information sources, but this approach undermines the very concept of a benchmark, which is meant to provide consistent, stable results for comparison. Others have proposed using platforms like Chatbot Arena, which pits LLMs against one another and lets users decide which model performs better. While this approach is immune to contamination concerns, it is subjective and unstable. Another idea is to use one LLM to evaluate another, but this process is far from reliable. None of these methods provides a confident way to measure LLMs’ ability to generalize.
The AI Industry’s Race for Profit and Progress
Despite the challenges, AI companies have started discussing "reasoning models," but the underlying technology remains largely unchanged since ChatGPT was released in November 2022. LLMs are still word-prediction algorithms, piecing together responses based on the vast amounts of text they’ve been trained on. While ChatGPT may seem to "figure out" the answers to your questions, it’s often hard to tell whether it’s truly reasoning or just drawing from its massive training data. Meanwhile, the AI industry is struggling financially, with companies yet to figure out how to turn a profit from building foundation models. In this context, the narrative of progress is more important than ever. If the benchmarks are meaningless and the models are merely memorizing, the AI industry may be heading for a rude awakening.