Large Language Models Are Memorizing the Datasets Meant to Test Them

If you rely on AI to recommend what to watch, read, or buy, new research indicates that some systems may be basing these results from memory rather than skill: instead of learning to make useful suggestions, the models often recall items from the datasets used to evaluate them, leading to overestimated performance and recommendations that […] The post Large Language Models Are Memorizing the Datasets Meant to Test Them appeared first on Unite.AI.

May 16, 2025 - 16:22
 0
Large Language Models Are Memorizing the Datasets Meant to Test Them
'Robot cheating in an exam' - ChatGPT-4o and Adobe Firefly

If you rely on AI to recommend what to watch, read, or buy, new research indicates that some systems may be basing these results from memory rather than skill: instead of learning to make useful suggestions, the models often recall items from the datasets used to evaluate them, leading to overestimated performance and recommendations that may be outdated or poorly-matched to the user.

 

In machine learning, a test-split is used to see if a trained model has learned to solve problems that are similar, but not identical to the material it was trained on.

So if a new AI ‘dog-breed recognition' model is trained on a dataset of 100,000 pictures of dogs, it will usually feature an 80/20 split – 80,000 pictures supplied to train the model; and 20,000 pictures held back and used as material for testing the finished model.

Obvious to say, if the AI's training data inadvertently includes the ‘secret' 20% section of test split, the model will ace these tests, because it already knows the answers (it has already seen 100% of the domain data). Of course, this does not accurately reflect how the model will perform later, on new ‘live' data, in a production context.

Movie Spoilers

The problem of AI cheating on its exams has grown in step with the scale of the models themselves. Because today's systems are trained on vast, indiscriminate web-scraped corpora such as Common Crawl, the possibility that benchmark datasets (i.e., the held-back 20%) slip into the training mix is no longer an edge case, but the default – a syndrome known as data contamination; and at this scale, the manual curation that could catch such errors is logistically impossible.

This case is explored in a new paper from Italy's Politecnico di Bari, where the researchers focus on the outsized role of a single movie recommendation dataset, MovieLens-1M, which they argue has been partially memorized by several leading AI models during training.

Because this particular dataset is so widely used in the testing of recommender systems, its presence in the models’ memory potentially makes those tests meaningless: what appears to be intelligence may in fact be simple recall, and what looks like an intuitive recommendation skill may just be a statistical echo reflecting earlier exposure.

The authors state:

‘Our findings demonstrate that LLMs possess extensive knowledge of the MovieLens-1M dataset, covering items, user attributes, and interaction histories. Notably, a simple prompt enables GPT-4o to recover nearly 80% of [the names of most of the movies in the dataset].

‘None of the examined models are free of this knowledge, suggesting that MovieLens-1M data is likely included in their training sets. We observed similar trends in retrieving user attributes and interaction histories.'

The brief new paper is titled Do LLMs Memorize Recommendation Datasets? A Preliminary Study on MovieLens-1M, and comes from six Politecnico researchers. The pipeline to reproduce their work has been made available at GitHub.

Method

To understand whether the models in question were truly learning or simply recalling, the researchers began by defining what memorization means in this context, and began by testing whether a model was able to retrieve specific pieces of information from the MovieLens-1M dataset, when prompted in just the right way.

If a model was shown a movie’s ID number and could produce its title and genre, that counted as memorizing an item; if it could generate details about a user (such as age, occupation, or zip code) from a user ID, that also counted as user memorization; and if it could reproduce a user’s next movie rating from a known sequence of prior ones, it was taken as evidence that the model may be recalling specific interaction data, rather than learning general patterns.

Each of these forms of recall was tested using carefully written prompts, crafted to nudge the model without giving it new information. The more accurate the response, the more likely it was that the model had already encountered that data during training:

Zero-shot prompting for the evaluation protocol used in the new paper. Source: https://arxiv.org/pdf/2505.10212

Zero-shot prompting for the evaluation protocol used in the new paper. Source: https://arxiv.org/pdf/2505.10212

Data and Tests

To curate a suitable dataset, the authors surveyed recent papers from two of the field’s major conferences, ACM RecSys 2024 , and ACM SIGIR 2024. MovieLens-1M appeared most often, cited in just over one in five submissions. Since earlier studies had reached similar conclusions,  this was not a surprising result, but rather a confirmation of the dataset’s dominance.

MovieLens-1M consists of three files: Movies.dat, which lists movies by ID, title, and genre; Users.dat, which maps user IDs to basic biographical fields; and Ratings.dat, which records who rated what, and when.

To find out whether this data had been memorized by large language models, the researchers turned to prompting techniques first introduced in the paper Extracting Training Data from Large Language Models, and later adapted in the subsequent work Bag of Tricks for Training Data Extraction from Language Models.

The method is direct: pose a question that mirrors the dataset format and see if the model answers correctly. Zero-shot, Chain-of-Thought, and few-shot prompting were tested, and it was found that the last method, in which the model is shown a few examples, was the most effective; even if more elaborate approaches might yield higher recall, this was considered sufficient to reveal what had been remembered.

Few-shot prompt used to test whether a model can reproduce specific MovieLens-1M values when queried with minimal context.

Few-shot prompt used to test whether a model can reproduce specific MovieLens-1M values when queried with minimal context.

To measure memorization, the researchers defined three forms of recall: item, user, and interaction. These tests examined whether a model could retrieve a movie title from its ID, generate user details from a UserID, or predict a user's next rating based on earlier ones. Each was scored using a coverage metric* that reflected how much of the dataset could be reconstructed through prompting.

The models tested were GPT-4o; GPT-4o mini; GPT-3.5 turbo; Llama-3.3 70B; Llama-3.2 3B; Llama-3.2 1B; Llama-3.1 405B; Llama-3.1 70B; and Llama-3.1 8B. All were run with temperature set to zero, top_p set to one, and both frequency and presence penalties disabled. A fixed random seed ensured consistent output across runs.

Proportion of MovieLens-1M entries retrieved from movies.dat, users.dat, and ratings.dat, with models grouped by version and sorted by parameter count.

Proportion of MovieLens-1M entries retrieved from movies.dat, users.dat, and ratings.dat, with models grouped by version and sorted by parameter count.

To probe how deeply MovieLens-1M had been absorbed, the researchers prompted each model for exact entries from the dataset’s three (aforementioned) files: Movies.dat, Users.dat, and Ratings.dat.

Results from the initial tests, shown above, reveal sharp differences not only between GPT and Llama families, but also across model sizes. While GPT-4o and GPT-3.5 turbo recover large portions of the dataset with ease, most open-source models recall only a fraction of the same material, suggesting uneven exposure to this benchmark in pretraining.

These are not small margins. Across all three files, the strongest models did not simply outperform weaker ones, but recalled entire portions of MovieLens-1M.

In the case of GPT-4o, the coverage was high enough to suggest that a nontrivial share of the dataset had been directly memorized.

The authors state:

‘Our findings demonstrate that LLMs possess extensive knowledge of the MovieLens-1M dataset, covering items, user attributes, and interaction histories.

‘Notably, a simple prompt enables GPT-4o to recover nearly 80% of MovieID::Title records. None of the examined models are free of this knowledge, suggesting that MovieLens-1M data is likely included in their training sets.

‘We observed similar trends in retrieving user attributes and interaction histories.'

Next, the authors tested for the impact of memorization on recommendation tasks by prompting each model to act as a recommender system. To benchmark performance, they compared the output against seven standard methods: UserKNN; ItemKNN; BPRMF; EASER; LightGCN; MostPop; and Random.

The MovieLens-1M dataset was split 80/20 into training and test sets, using a leave-one-out sampling strategy to simulate real-world usage. The metrics used were Hit Rate (HR@[n]); and nDCG(@[n]):

Recommendation accuracy on standard baselines and LLM-based methods. Models are grouped by family and ordered by parameter count. Bold values indicate the highest score within each group.

Recommendation accuracy on standard baselines and LLM-based methods. Models are grouped by family and ordered by parameter count, with bold values indicating the highest score within each group.

Here several large language models outperformed traditional baselines across all metrics, with GPT-4o establishing a wide lead in every column, and even mid-sized models such as GPT-3.5 turbo and Llama-3.1 405B consistently surpassing benchmark methods such as BPRMF and LightGCN.

Among smaller Llama variants, performance varied sharply, but Llama-3.2 3B stands out, with the highest HR@1 in its group.

The results, the authors suggest, indicate that memorized data can translate into measurable advantages in recommender-style prompting, particularly for the strongest models.

In an additional observation, the researchers continue:

‘Although the recommendation performance appears outstanding, comparing Table 2 with Table 1 reveals an interesting pattern. Within each group, the model with higher memorization also demonstrates superior performance in the recommendation task.

‘For example, GPT-4o outperforms GPT-4o mini, and Llama-3.1 405B surpasses Llama-3.1 70B and 8B.

‘These results highlight that evaluating LLMs on datasets leaked in their training data may lead to overoptimistic performance, driven by memorization rather than generalization.'

Regarding the impact of model scale on this issue, the authors observed a clear correlation between size, memorization, and recommendation performance, with larger models not only retaining more of the MovieLens-1M dataset, but also performing more strongly in downstream tasks.

Llama-3.1 405B, for example, showed an average memorization rate of 12.9%, while Llama-3.1 8B retained only 5.82%. This nearly 55% reduction in recall corresponded to a 54.23% drop in nDCG and a 47.36% drop in HR across evaluation cutoffs.

The pattern held throughout – where memorization decreased, so did apparent performance:

‘These findings suggest that increasing the model scale leads to greater memorization of the dataset, resulting in improved performance.

‘Consequently, while larger models exhibit better recommendation performance, they also pose risks related to potential leakage of training data.'

The final test examined whether memorization reflects the popularity bias baked into MovieLens-1M. Items were grouped by frequency of interaction, and the chart below shows that larger models consistently favored the most popular entries:

Item coverage by model across three popularity tiers: top 20% most popular; middle 20% moderately popular; and the bottom 20% least interacted items.

Item coverage by model across three popularity tiers: top 20% most popular; middle 20% moderately popular; and the bottom 20% least interacted items.

GPT-4o retrieved 89.06% of top-ranked items but only 63.97% of the least popular. GPT-4o mini and smaller Llama models showed much lower coverage across all bands. The researchers state that this trend suggests that memorization not only scales with model size, but also amplifies preexisting imbalances in the training data.

They continue:

‘Our findings reveal a pronounced popularity bias in LLMs, with the top 20% of popular items being significantly easier to retrieve than the bottom 20%.

‘This trend highlights the influence of the training data distribution, where popular movies are overrepresented, leading to their disproportionate memorization by the models.'

Conclusion

The dilemma is no longer novel: as training sets grow, the prospect of curating them diminishes in inverse proportion. MovieLens-1M, perhaps among many others, enters these vast corpora without oversight, anonymous amidst the sheer volume of data.

The problem repeats at every scale and resists automation. Any solution demands not just effort but human judgment –  the slow, fallible kind that machines cannot supply. In this respect, the new paper offers no way forward.

 

* A coverage metric in this context is a percentage that shows how much of the original dataset a language model is able to reproduce when asked the right kind of question. If a model is prompted with a movie ID and responds with the correct title and genre, that counts as a successful recall. The total number of successful recalls is then divided by the total number of entries in the dataset to produce a coverage score. For example, if a model correctly returns information for 800 out of 1,000 items, its coverage would be 80 percent.

First published Friday, May 16, 2025

The post Large Language Models Are Memorizing the Datasets Meant to Test Them appeared first on Unite.AI.