Meta accused of using pirated data for AI development

News Room

Plaintiffs in the case of Kadrey et al. vs. Meta have filed a motion alleging the firm knowingly used copyrighted works in the development of its AI models.

The plaintiffs, which include author Richard Kadrey, filed their “Reply in Support of Plaintiffs’ Motion for Leave to File Third Amended Consolidated Complaint” in the United States District Court in the Northern District of California.

The filing accuses Meta of systematically torrenting and stripping copyright management information (CMI) from pirated datasets, including works from the notorious shadow library LibGen.

According to documents recently submitted to the court, evidence reveals highly incriminating practices involving Meta’s senior leaders. Plaintiffs allege that Meta CEO Mark Zuckerberg gave explicit approval for the use of the LibGen dataset, despite internal concerns raised by the company’s AI executives.

A December 2024 memo from internal Meta discussions acknowledged LibGen as “a dataset we know to be pirated,” with debates arising about the ethical and legal ramifications of using such materials. Documents also revealed that top engineers hesitated to torrent the datasets, citing concerns about using corporate laptops for potentially unlawful activities.

Additionally, internal communications suggest that after acquiring the LibGen dataset, Meta stripped CMI from the copyrighted works contained within—a practice that plaintiffs highlight as central to claims of copyright infringement.

According to the deposition of Michael Clark – a corporate representative for Meta – the company implemented scripts designed to remove any information identifying these works as copyrighted, including keywords like “copyright,” “acknowledgements,” or lines commonly used in such texts. Clark attested that this practice was done intentionally to prepare the dataset for training Meta’s Llama AI models.  

“Doesn’t feel right”

The allegations against Meta paint a portrait of a company knowingly partaking in a widespread piracy scheme facilitated through torrenting.

According to a string of emails included as exhibits, Meta engineers expressed concerns about the optics of torrenting pirated datasets from within corporate spaces. One engineer noted that “torrenting from a [Meta-owned] corporate laptop doesn’t feel right,” but despite hesitation, the rapid downloading and distribution – or “seeding” – of pirated data took place.

Legal counsel for the plaintiffs has stated that as late as January 2024, Meta had “already torrented (both downloaded and distributed) data from LibGen.” Moreover, records show that hundreds of related documents were initially obtained by Meta months prior but were withheld during early discovery processes. Plaintiffs argue this delayed disclosure amounts to bad-faith attempts by Meta to obstruct access to vital evidence.

During a deposition on 17 December 2024, Zuckerberg himself reportedly admitted that such activities would raise “lots of red flags” and stated it “seems like a bad thing,” though he provided limited direct responses regarding Meta’s broader AI training practices.

This case originally began as an intellectual property infringement action on behalf of authors and publishers claiming violations relating to AI use of their materials. However, the plaintiffs are now seeking to add two major claims to their suit: a violation of the Digital Millennium Copyright Act (DMCA) and a breach of the California Comprehensive Data Access and Fraud Act (CDAFA).  

Under the DMCA, the plaintiffs assert that Meta knowingly removed copyright protections to conceal unauthorised uses of copyrighted texts in its Llama models.

As cited in the complaint, Meta allegedly stripped CMI “to reduce the chance that the models will memorise this data” and that this removal of rights management indicators made discovering the infringement more difficult for copyright holders. 

The CDAFA allegations involve Meta’s methods for obtaining the LibGen dataset, including allegedly engaging in torrenting to acquire copyrighted datasets without permission. Internal documentation shows Meta engineers openly discussed concerns that seeding and torrenting might prove to be “legally not ok.” 

Meta case may impact emerging legislation around AI development

At the heart of this expanding legal battle lies growing concern over the intersection of copyright law and AI.

Plaintiffs argue the stripping of copyright protections from textual datasets denies rightful compensation to copyright owners and allows Meta to build AI systems like Llama on the financial ruins of authors’ and publishers’ creative efforts.

The timing of these allegations arises amidst heightened global scrutiny surrounding “generative AI” technologies. Companies like OpenAI, Google, and Meta have all come under fire regarding the use of copyrighted data to train their models. Courts across jurisdictions are currently grappling with the long-term impact of AI on rights management, with potentially landmark cases being decided in both the US and the UK.  

In this particular case, US courts have shown increasing willingness to hear complaints about AI’s potential harm to long-established copyright law precedents. Plaintiffs, in their motion, referred to The Intercept Media v. OpenAI, a recent decision from New York in which a similar DMCA claim was allowed to proceed.

Meta continues to deny all allegations in the case and has yet to publicly respond to Zuckerberg’s reported deposition statements.

Whether or not plaintiffs succeed in these amendments, authors across the world face growing anxieties about how their creative works are handled within the context of AI. With copyright law struggling to keep pace with technological advances, this case underscores the need for clearer guidance at an international level to protect both creators and innovators.

For Meta, these claims also represent a reputational risk. As AI becomes the central focus of its future strategy, the allegations of reliance on pirated libraries are unlikely to help its ambitions of maintaining leadership in the field.  

The unfolding case of Kadrey et al. vs. Meta could have far-reaching ramifications for the development of AI models moving forward, potentially setting legal precedents in the US and beyond.

(Photo by Amy Syiek)

See also: UK wants to prove AI can modernise public services responsibly

Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with other leading events including Intelligent Automation Conference, BlockX, Digital Transformation Week, and Cyber Security & Cloud Expo.

Explore other upcoming enterprise technology events and webinars powered by TechForge here.

Tags: ai, artificial intelligence, copyright, court, development, ethics, government, law, legal, meta, motion, regulation

Read the full article here

Share This Article
Leave a comment

Leave a Reply

Your email address will not be published. Required fields are marked *