One-sentence summary: Meta and OpenAI’s use of the massive, pirated Library Genesis (LibGen) database to train AI models like Llama raises serious legal, ethical, and societal concerns over copyright infringement and the future of knowledge sharing.
Alex Reisner’s investigation for The Atlantic reveals that employees at Meta, while developing their AI model Llama 3, knowingly used the extensive pirated book database Library Genesis (LibGen) due to the urgency and costliness of obtaining licensed material. LibGen contains over 7.5 million books and 81 million research papers, making it a highly attractive but legally risky resource for AI training. Internal communications show Meta employees were aware of the legal implications, referred to the act as “medium-high legal risk,” and discussed ways to obscure their tracks-including avoiding citations of LibGen and removing metadata from pirated content.
These revelations come amid lawsuits from authors like Sarah Silverman and Junot Díaz, who allege copyright infringement. Court documents also reveal that OpenAI used LibGen in the past, although the company says the datasets were last used in 2021 and not in current models. The article includes a link to an interactive tool that allows users to explore the LibGen database, shedding light on the scale and scope of pirated works involved-from mainstream novels to top-tier academic journal articles.
The piece also explores the historical context and motivations behind LibGen and its sibling Sci-Hub, originally created to provide access to information for people in countries or institutions where scholarly resources are unaffordable or unavailable. Despite repeated lawsuits and court-ordered fines against LibGen and its affiliates, enforcement has largely failed, allowing these sites to continue operating.
The ethical dilemma goes beyond piracy: Generative AI models built on these pirated works are being commercialized and presented as sources of knowledge, often without proper attribution or transparency. This raises broader questions about fairness, the ownership of intellectual labor, and whether generative-AI outputs truly benefit society-or erode the foundations of human-driven scholarship and creativity.
Reisner concludes by questioning whether generative AI, built on absorbed and decontextualized human knowledge, can truly advance science and society, or whether it merely monetizes others’ intellectual labor while sidelining real human dialogue and contribution.
Reisner, Alex. “The Unbelievable Scale of AI’s Pirated-Books Problem.” The Atlantic, 21 Mar. 2025, www.theatlantic.com/technology/archive/2025/03/libgen-meta-openai/682093.
Key takeaways:
- Meta used LibGen, a large pirated library of books and research papers, to train its AI model Llama 3.
- Internal company communications show Meta knew the legal risks but prioritized access and speed.
- OpenAI also used LibGen data in the past, though claims it’s no longer part of current models.
- LibGen includes millions of pirated academic and literary works and continues to grow despite legal challenges.
- Use of pirated data raises legal, ethical, and societal concerns about how AI is trained and who benefits from it.
- AI companies argue “fair use,” but courts have yet to definitively rule on this defense in the context of generative AI.
Most important quotations:
- “Books are actually more important than web data.”
- “Torrenting from a corporate laptop doesn’t feel right.”
- “If we license one single book, we won’t be able to lean into fair use strategy.”
- “It is easy to see why LibGen appeals to generative-AI companies.”
- “Generative-AI chatbots are presented as oracles… and often don’t cite sources.”
- “Will these be better for society than the human dialogue they are already starting to replace?”
Word count of summary: 691
Word count of original article: 2,412
Model version: GPT-4
Custom GPT name: Summarizer 2
SEO tags for blog post:
AI training datasets, copyright and AI, Library Genesis lawsuit, Meta Llama 3 AI