Salesforce, a software giant, has been sued by a group of authors in federal court in San Francisco for building its XGen AI models on a pirated library of books. According to the lawsuit, they scrubbed references to those sources once questions arose.
The lawsuit was filed on Wednesday by authors E. Molly Tanzer and Jennifer Gilmore under the Copyright Act. It states ongoing infringement, saying Salesforce “continues to do so by continuing to store, copy, use, and process the datasets containing copies of Plaintiffs’[…] copyrighted books.”
The complaint cites statements from Salesforce CEO Marc Benioff, who told a Bloomberg interviewer in January 2024 that AI companies ripped off training data and that all the training data has been stolen.
The authors seek class certification for all US copyright holders whose works have been used since October 2022. They are seeking statutory damages, the destruction of infringing copies, the return of profits, a declaration of willful infringement, and attorneys’ fees.
According to the complaint, Salesforce pirated hundreds of thousands of copyrighted books to develop its XGen series of large language models. They did this by using the “notorious RedPajama and The Pile datasets,” which have a book corpus called Books3 that has more than 196,000 books copied from the private tracker Bibliotik.
The filing states that Salesforce first mentioned “RedPajama-Books” as one of its training sources when it launched XGen in June 2023. An engineer for the company then linked GitHub users directly to both datasets.
However, by September, those mentions were taken down from Salesforce’s website and replaced with vague descriptions of “natural language data” from “publicly available sources.” The next month, Hugging Face, the site that hosted Books3, removed the dataset due to copyright concerns.
Additionally, the lawsuit revealed that in 2022, Salesforce trained its CodeGen models on The Pile. The company then introduced the technology to the market through its Agentforce AI platform, with the XGen-Sales model being released in October 2024.
However, according to experts, authors must prove real financial harm, not just that their books were used for training. Recently, Judge Vince Chhabria dismissed similar claims against Meta, ruling that “simply claiming ‘our work was used’ isn’t enough.” To that end, the judge found Meta’s use of copyrighted books for training AI as fair use.
Additionally, as reported by Cryptopolitan, recent rulings have favored OpenAI and Anthropic in similar cases, with judges finding that authors failed to prove market harm. However, one judge criticized Anthropic for maintaining a permanent library of pirated books.
In other news, Salesforce has extended its partnership with Google to include deeper integration of Gemini AI models with its Agentforce 360 platform.
Gemini’s multimodal intelligence will be integrated into the Salesforce ecosystem as a result of the partnership. This will help support tasks such as hybrid reasoning and multi-step process automation across enterprise sales and IT services.
The expanded integration enables the Atlas Reasoning Engine, central to Agentforce 360, to leverage Gemini models. This gives enterprise workflows additional model options.
Additionally, the hybrid reasoning capability enables users to set up AI agents within Salesforce that produce consistent and accurate outputs. The collaboration also extends the reach of Salesforce’s Gemini integration, previously limited to Gmail, to other Google Workspace applications, including Sheets, Docs, Drive, Slides, and Meet.
Agentforce 360 now supports native interoperability with Google Workspace, allowing users to initiate sales engagements, qualify leads, and schedule meetings from within applications like Gmail and Google Calendar. It also provides direct access to Salesforce Customer 360 apps within Google tools, streamlining data access and workflow continuity for sales and service teams.
Salesforce chief scientist Silvio Savarese said, “In the enterprise environment, it’s imperative for AI agents to be highly capable and highly consistent, especially for critical use cases […] Together, we are setting a new standard for building the future of what’s possible in the Agentic Enterprise down to the model level.”
Sharpen your strategy with mentorship + daily ideas - 30 days free access to our trading program