Reddit’s Lawsuit Against Anthropic May Set New Ground Rules for AI Training

A new front has opened in the battle for how AI companies train their models. Reddit, one of the most widely used social news and forum platforms, has sued AI startup Anthropic for allegedly using its content without permission to train its Claude models.  On the surface, it’s a dispute over data, but the implications run much deeper. As more platforms push back, the AI industry faces a pressing question: can it keep building on data it doesn’t own? It also raises concerns about how AI is licensed and commercialized. Reddit claims that Anthropic accessed its platform more than 100,000 times since July 2024 to scrape user-generated content for AI training, in violation of Reddit’s terms of service. The platform also claims that Anthropic reportedly assured it had blocked its bots from accessing Reddit, but continued to do so anyway.  Anthropic said in a statement that it disagreed with Reddit’s claims and that it intends to “defend ourselves vigorously.” That’s hardly a surprise. Given the scrutiny around AI training practices, companies are unlikely to concede any missteps — at least not publicly. The lawsuit is filed in the California Superior Court in San Francisco, where both companies are based. It's the latest in a growing wave of legal action aimed at how AI companies collect and use data. Over the past couple of years, The New York Times sued OpenAI and Microsoft for allegedly reproducing its articles without authorization, while Getty Images took Stability AI to court for using millions of images without a license.  Reddit is a rich source of training data, but it is also vulnerable. Its content, which includes millions of real and unscripted conversations, offers exactly the kind of human language that large language models (LLMs) are built to mimic. That makes it incredibly useful for AI developers, but also puts it at the center of a growing debate over who really owns the internet’s conversations. However, Reddit isn’t treating its data as a public resource. In fact, last year it signed a $60 million licensing deal with Google, followed by another agreement with OpenAI to give the companies paid access to its data for training AI models. These deals aren’t just about generating revenue, they are part of Reddit’s broader strategy to position its content as a premium asset. Through the lawsuit, Reddit is seeking monetary compensation for what it describes as unauthorized use of its data. Reddit claims that Anthropic’s “commercial exploitation” of its content could be worth billions of dollars.  Reddit is also seeking injunctive relief by asking the court to block Anthropic from continuing to use Reddit data. This could be a serious problem for Anthropic. If the court grants that request, Anthropic might have to strip Reddit data out of its training sets, delete parts of its model weights, and possibly retrain its models from scratch. This is not the first time Anthropic has been at the center of a legal fight over how it uses data. In August 2024, a group of authors filed a class-action lawsuit in California, accusing the company of building a multibillion-dollar business by copying hundreds of thousands of copyrighted books. Just two months later, Universal Music Group sued Anthropic in Tennessee, claiming its models were generating copyrighted song lyrics on a “systematic and widespread” scale. For an AI developer, access to data is make-or-break. Any legal precedent that restricts this access could raise significant barriers to progress. If this lawsuit results in a positive outcome for Reddit, it could have vast implications for the future of AI models. Not only would AI developers need explicit permission or licenses to train on user-generated content, but the model pipelines will also get more expensive and slower to develop. It could also mean that AI models could potentially be narrower in scope.  If the outcome goes in favor of Anthropic, AI developers would breathe more easily. It would allow them the freedom to continue scraping openly available data under the broader definition of fair use. If that is the case, then floodgates may stay open. However, it may only be temporary. Legislative pressure is building in both the U.S. and the EU to regulate training data use. The Reddit vs. Anthropic lawsuit could help define the legal boundaries of AI training in the years ahead. As questions around data ownership, consent, and platform control grow more urgent, this lawsuit highlights the need for clearer rules that apply to the rapidly evolving AI ecosystem.   

Jun 10, 2025 - 23:00
 0
Reddit’s Lawsuit Against Anthropic May Set New Ground Rules for AI Training

A new front has opened in the battle for how AI companies train their models. Reddit, one of the most widely used social news and forum platforms, has sued AI startup Anthropic for allegedly using its content without permission to train its Claude models. 

On the surface, it’s a dispute over data, but the implications run much deeper. As more platforms push back, the AI industry faces a pressing question: can it keep building on data it doesn’t own? It also raises concerns about how AI is licensed and commercialized.

Reddit claims that Anthropic accessed its platform more than 100,000 times since July 2024 to scrape user-generated content for AI training, in violation of Reddit’s terms of service. The platform also claims that Anthropic reportedly assured it had blocked its bots from accessing Reddit, but continued to do so anyway. 

Shutterstock AI-generated

Anthropic said in a statement that it disagreed with Reddit’s claims and that it intends to “defend ourselves vigorously.” That’s hardly a surprise. Given the scrutiny around AI training practices, companies are unlikely to concede any missteps — at least not publicly.

The lawsuit is filed in the California Superior Court in San Francisco, where both companies are based. It's the latest in a growing wave of legal action aimed at how AI companies collect and use data. Over the past couple of years, The New York Times sued OpenAI and Microsoft for allegedly reproducing its articles without authorization, while Getty Images took Stability AI to court for using millions of images without a license. 

Reddit is a rich source of training data, but it is also vulnerable. Its content, which includes millions of real and unscripted conversations, offers exactly the kind of human language that large language models (LLMs) are built to mimic. That makes it incredibly useful for AI developers, but also puts it at the center of a growing debate over who really owns the internet’s conversations.

However, Reddit isn’t treating its data as a public resource. In fact, last year it signed a $60 million licensing deal with Google, followed by another agreement with OpenAI to give the companies paid access to its data for training AI models. These deals aren’t just about generating revenue, they are part of Reddit’s broader strategy to position its content as a premium asset.

Through the lawsuit, Reddit is seeking monetary compensation for what it describes as unauthorized use of its data. Reddit claims that Anthropic’s “commercial exploitation” of its content could be worth billions of dollars. 

Source: Reddit.com

Reddit is also seeking injunctive relief by asking the court to block Anthropic from continuing to use Reddit data. This could be a serious problem for Anthropic. If the court grants that request, Anthropic might have to strip Reddit data out of its training sets, delete parts of its model weights, and possibly retrain its models from scratch.

This is not the first time Anthropic has been at the center of a legal fight over how it uses data. In August 2024, a group of authors filed a class-action lawsuit in California, accusing the company of building a multibillion-dollar business by copying hundreds of thousands of copyrighted books. Just two months later, Universal Music Group sued Anthropic in Tennessee, claiming its models were generating copyrighted song lyrics on a “systematic and widespread” scale.

For an AI developer, access to data is make-or-break. Any legal precedent that restricts this access could raise significant barriers to progress. If this lawsuit results in a positive outcome for Reddit, it could have vast implications for the future of AI models. Not only would AI developers need explicit permission or licenses to train on user-generated content, but the model pipelines will also get more expensive and slower to develop. It could also mean that AI models could potentially be narrower in scope. 

Shutterstock

If the outcome goes in favor of Anthropic, AI developers would breathe more easily. It would allow them the freedom to continue scraping openly available data under the broader definition of fair use. If that is the case, then floodgates may stay open. However, it may only be temporary. Legislative pressure is building in both the U.S. and the EU to regulate training data use.

The Reddit vs. Anthropic lawsuit could help define the legal boundaries of AI training in the years ahead. As questions around data ownership, consent, and platform control grow more urgent, this lawsuit highlights the need for clearer rules that apply to the rapidly evolving AI ecosystem.