Reddit blocks AI scraping: what changes for LLMs and open source

On March 25, 2026, Reddit sent shockwaves through the AI community: the platform is shutting its doors to automated scrapers, requiring biometric verification for suspicious accounts, and removing 100,000 bot accounts every day.

For regular users, this is good news: less spam, fewer fake accounts, a healthier community.

For developers training language models, it’s a different story.

Reddit is not just a social platform: it’s one of the most valuable training data sources in the world for LLMs.

What this announcement changes for open source AI goes far beyond scraping: access to an irreplaceable corpus of human knowledge is closing off, and only those who can afford to pay will keep it open.

Key takeaways:

Reddit is blocking AI scraping via biometric verification (World ID, Face ID, passkeys) and removing 100,000 bots per day since March 25, 2026
Google ($60M/year) and OpenAI ($70M/year) hold exclusive licensing agreements: proprietary models keep accessing the data, open source does not
Local fine-tuning on specialized subreddits (r/MachineLearning, r/Python) is now illegal without a license: a structural divide is widening between proprietary players and community projects
Pushshift, the community archive of Reddit data, has been progressively taken offline: legal alternatives (Common Crawl, The Pile, Hugging Face) don’t fill the gap
For French-language datasets, the impact is amplified: Reddit was one of the few sources of authentic large-scale human conversations in French

What Reddit just announced (and why this time is different)

The fight against bots on Reddit is nothing new.

What changes with the March 25, 2026 announcement is the arsenal deployed: the platform now requires human verification for suspicious accounts, relying on third-party tools such as biometric passkeys from Apple, Google and YubiKey, Face ID facial recognition, and Sam Altman’s World ID.

Steve Huffman, Reddit’s CEO, summed up the philosophy: “Our goal is to confirm that a person stands behind the account, not who that person is.”

Reddit wants to know you’re human, not who you are: a thin line between community protection and mass surveillance.

New [App] labels allow legitimate bot developers (moderation, content filtering, analytics tools) to register officially via r/redditdev: a clear distinction between useful bots and unauthorized AI scrapers.

The staggering figure: 100,000 accounts removed every day, identified through precise behavioral signals: abnormal posting speeds, suspicious voting patterns, connections from proxy networks.

This is the quantified confirmation of a reality Cloudflare has been documenting for two years: bots already account for the majority of internet traffic, with an alarming prediction placing 2027 as the tipping point where automated traffic will definitively outpace human traffic.

What Reddit is doing looks strikingly similar to what newspapers did with Google News: shutting off the free access tap to monetize the real value of its content.

Reddit, the LLM goldmine: why this data is irreplaceable

To grasp the scale of the problem, you need to understand why Reddit is so valuable for model training.

The web contains petabytes of text, but the quality of that text is uneven: auto-generated product pages, SEO spam, 404 errors, empty content.

Reddit is different because of its very structure: the voting system naturally filters for quality content.

A useful comment on r/MachineLearning rises to the top, a false or approximate one sinks: Reddit unknowingly created the best quality control mechanism in conversational internet.

Subreddits are organized by topic: a model trained on r/MachineLearning learns the vocabulary, reasoning patterns and common mistakes of machine learning practitioners, not fragments diluted across billions of irrelevant pages.

Reddit captures a form of knowledge that books and academic papers cannot convey: tacit knowledge.

Reddit scraping ia donnees ouvertes vs fermees

When an experienced developer answers a JavaScript debugging question on r/learnprogramming, they articulate their thinking as an expert speaking to a beginner: this dialectical format is extraordinarily useful for teaching an LLM to generate nuanced responses.

Add to that a corpus stretching back to 2005, billions of archived messages, and over 100 million daily active users: Reddit is to LLMs what the Library of Alexandria was to ancient scholars.

For more on the strategic value of Reddit for LLMs, we analyzed how the platform has become a pillar of modern artificial intelligence.

The concrete impact on open source AI and fine-tuning

Proprietary models win, open source loses

Reddit isn’t closing its data to everyone: it’s monetizing it through deals with tech giants.

Google signed an agreement worth $60 million per year in February 2024, granting real-time access to Reddit content via its Data API to train Gemini.

OpenAI followed with an agreement estimated at $70 million annually, giving ChatGPT structured access to Reddit’s archives and their continuous updates.

The logic is straightforward: these deals are bidirectional.

Google funds Reddit and, in return, Reddit uses Vertex AI to strengthen its internal search function: a symbiosis that benefits both parties simultaneously.

$130 million a year separates proprietary models from open source: not a skills gap, a budget gap.

Open source projects like Llama or Mistral don’t have access to these licenses: they must make do with legal alternatives whose quality is structurally inferior.

The cruel irony: AI bots were actively posting on Reddit to generate technical discussions, creating training data for their own successors.

Reddit has ended this loop, but only for players without a license.

Local fine-tuning on borrowed time

Imagine an independent developer who wants to fine-tune Llama 3 on r/MachineLearning discussions to build a specialized machine learning assistant.

Until early 2024, this was technically feasible via direct scraping or through Pushshift, the community archive that indexed the entirety of Reddit’s data.

Pushshift has been progressively taken offline, and Reddit simultaneously blocked access to the Internet Archive’s Wayback Machine for its historical data.

This developer now has three options, all problematic: go without Reddit and accept lower fine-tuning quality, use historical archives in a legally uncertain framework, or violate the terms of service and risk a lawsuit.

The contrast with the situation of large proprietary models is staggering: while this developer searches for alternatives, GPT-4o trains on Reddit data in real time, continuously updated through the OpenAI partnership.

The legal precedent: the Reddit vs Perplexity lawsuit

In October 2024, Reddit filed a lawsuit against Perplexity AI and SerpApi for unauthorized scraping of its data.

This lawsuit goes beyond a simple scraper dispute: it establishes a legal precedent on the ownership of data generated by a platform’s users.

The core question of the case is fundamental: can a platform claim exclusive rights over content that its users created freely and without compensation?

Analysis by law firm Troutman Pepper suggests Reddit has strong arguments: the terms of service explicitly prohibit commercial scraping, and Reddit data constitutes a valued commercial asset, as demonstrated by the multi-tens-of-millions agreements signed with Google and OpenAI.

For the AI community, this lawsuit marks a turning point: scraping web data to train models is entering an expanding hostile legal zone, with platforms now both capable and motivated to prosecute violators.

The parallel with Digg, shut down in 2026 because bots had overwhelmed the platform and destroyed its value for human users, is telling: where Digg succumbed to bots, Reddit chooses to fight them by monetizing its defenses.

Reddit scraping ia developpeur open source canyon

Legal alternatives for training your models

Legal alternatives do exist.

None replaces Reddit on its own.

Common Crawl (commoncrawl.org) is the largest freely available corpus, with petabytes of data from billions of web pages: without the quality filtering by vote that makes Reddit powerful, the density of useful information is far lower.

The Pile, developed by EleutherAI, combines 22 sources including academic papers, GitHub code, Wikipedia data and historical Reddit archives: its 825 gigabytes represent an excellent starting point for general-purpose training.

Wikipedia dumps and Hugging Face datasets (huggingface.co/datasets) offer quality thematic corpora for specific domains.

For developers focused on open AI and alternatives to proprietary models, projects like Mistral’s open source models show that it’s possible to build high-performing models with alternative data, provided you invest in quality curation.

Specialized alternative forums (Stack Overflow, Hacker News, community Discord servers) represent an underexplored avenue: their content is more technical and less noisy than the generic web, with often more permissive licenses.

For French-language datasets, the situation is even more strained: Reddit was one of the few sources of authentic large-scale human conversations in French.

FR datasets are structurally underrepresented in public corpora, which amplifies the impact of Reddit’s exclusion for anyone building models targeting the French-speaking market.

What this means for the future of open AI

Reddit’s March 25, 2026 announcement is part of a broader movement: the progressive tightening of access to web training data.

The European AI Act is pushing in the same direction: transparency obligations on training data make covert scraping even riskier, since models deployed in Europe will need to prove their data complies with the regulatory framework.

For open source developers, the scenario taking shape is one of lasting stratification: proprietary models funded by data licenses keep advancing with fresh, high-quality data, while community projects plateau on static corpora.

This is not inevitable: the open source response lies in collaborative dataset creation, community curation, and developing training techniques less dependent on raw data volume, such as PEFT/LoRA fine-tuning, which allows adapting an existing model to a specific domain with far less data.

The real answer to data closure isn’t finding better scrapers: it’s building models that learn more from less.

For those looking to explore these alternatives without depending on large proprietary models, local AI solutions offer a concrete path: data access and training control remain in the developer’s hands.

The quiet exodus of developers and users toward alternative tools, which we documented in our analysis of the exodus toward alternatives to proprietary models, is not unrelated to this dynamic: dependence on data and licenses from large platforms is driving the search for autonomy.

The question is no longer whether training data will become a paid, controlled resource: it already is.

The real question is whether open source AI will be able to build its own data infrastructure before the gap with proprietary models becomes impossible to close.

FAQ

What exactly did Reddit announce on March 25, 2026?

Reddit announced enhanced anti-bot measures: mandatory biometric verification for suspicious accounts (World ID, Face ID, passkeys), an [App] label system to identify legitimate bots, and the removal of 100,000 automated accounts per day using behavioral detection tools.

Why is Reddit data so important for LLMs?

Reddit combines massive volume (billions of messages since 2005), thematic organization by subreddits, and a voting mechanism that naturally filters for quality content: it’s a corpus of authentic human conversations that the generic web cannot replicate.

Do Google and OpenAI still have access to Reddit data despite the block?

Yes: Google signed a $60 million per year agreement and OpenAI an estimated $70 million annual deal, giving both parties privileged, real-time access to Reddit data via the official API.

What is Pushshift and why is its inaccessibility a problem?

Pushshift was a community archive that indexed the entirety of Reddit’s historical data and allowed researchers and developers to access it freely for research and training projects: Reddit has progressively taken Pushshift offline, cutting off this major alternative.

What legal alternatives exist for training a model without Reddit?

The main alternatives include Common Crawl (massive web corpus), EleutherAI’s The Pile (825 GB combining 22 quality sources), Wikipedia dumps, Hugging Face datasets, and specialized forums like Stack Overflow or Hacker News.

Has the Reddit vs Perplexity lawsuit created a legal precedent?

This lawsuit filed in October 2024 establishes that platforms can pursue commercial scrapers of their data in court: it places unauthorized scraping for AI training in a hostile legal zone, with real legal risks for violators.

How does the European AI Act affect the situation?

The AI Act imposes transparency obligations on training data for models deployed in Europe: data scraped without authorization becomes even riskier to use, which reinforces the advantage of players holding official licenses.

Is the impact stronger for French-language datasets?

Yes: French-language corpora are structurally underrepresented in public training datasets, and Reddit was one of the few sources of authentic large-scale human conversations in French: the loss is proportionally heavier for models targeting the French-speaking market.

Is local fine-tuning of an open source model still possible without Reddit data?

Fine-tuning remains possible with techniques like LoRA/PEFT that require less data, but quality will be lower for domains where Reddit excels: programming, machine learning, specialized technical discussions.

What is the “dead internet theory” and what does it have to do with Reddit’s announcement?

The dead internet theory posits that the majority of online content is now generated by bots rather than humans: the removal of 100,000 bot accounts per day by Reddit confirms this is no longer speculation, and Cloudflare predicts automated traffic will surpass human traffic by 2027.

Reddit blocks AI scraping: what it means for LLMs and open source

What Reddit just announced (and why this time is different)

Reddit, the LLM goldmine: why this data is irreplaceable

The concrete impact on open source AI and fine-tuning

Proprietary models win, open source loses

Local fine-tuning on borrowed time

The legal precedent: the Reddit vs Perplexity lawsuit

Legal alternatives for training your models

What this means for the future of open AI

FAQ

Related Articles

Claude Mythos: what the Capybara leak reveals about Anthropic’s next model

Jensen Huang declares AGI achieved: analysis of a divisive announcement

Ready to scale your business?

Encore quelques questions ?