72
Anthropic destroyed millions of print books to build its AI models
(arstechnica.com)
A nice place to discuss rumors, happenings, innovations, and challenges in the technology sphere. We also welcome discussions on the intersections of technology and society. If it’s technological news or discussion of technology, it probably belongs here.
Remember the overriding ethos on Beehaw: Be(e) Nice. Each user you encounter here is a person, and should be treated with kindness (even if they’re wrong, or use a Linux distro you don’t like). Personal attacks will not be tolerated.
Subcommunities on Beehaw:
This community's icon was made by Aaron Schneider, under the CC-BY-NC-SA 4.0 license.
The books were purchased and destroyed to digitize them. There is nothing wrong with digitizing a work. The books were destroyed because duplicating a work without permission is illegal, but destroying the original means that there is only one copy in the end still.
The LLM training is the problem. This is not.
Hit the nail on the head.
Millions and millions of print books are destroyed all the time, and very rarely is anything of value lost. Libraries, thrift stores, and used book stores get inundated thousands of books donated to them, most of which nobody wants. Unless you, personally, are going to take on sorting, transporting, and storing dozens of duplicate copies of books in poor condition, and have some purpose for them (presumably?), then get off your high horse about the destruction of bulk-purchased used books.
Individual copies of mass-published books are not precious. Only rare books are important for preservation. And, even then, digital copies are much more practical for long-term storage than physical books. Anna's Archive's preservation project as a shadow library is only possible because data storage is very cheap, infinitely replicable, and practically free to transport.
It is not illegal if you don't distribute, which the judge ruled meant this was fair use. They destroyed the books as part of the digitizing project because it is likely faster and cheaper than non-destructive methods.
That is not how this works at all. As long as you aren't distributing, you are well within your rights to make copies of a book you purchase.
Quoting the analysis in the ruling:
In other words, part of what is being ruled is whether digitizing the books was fair use. Reinforcing that:
Bold text is me. Italics are the ruling.
Further down:
The judge ruled that the digitization is fair use.
Notably, the question about fair use is important because of what the work is being used for. These are being used in a commercial setting to make money, not in a private setting. Additionally, as the works were inputs into the LLM, it is related to the judge's decision on whether using them to train the LLM is fair use.
Naturally the pirated works are another story, but this article is about the destruction of the physical copies, which only happened for works they purchased. Pirating for LLMs is unacceptable, but that isn't the question here.
The ruling does go on to indicate that Anthropic might have been able to get away with not destroying the originals, but destroying them meant that the format change was "more clearly transformative" as a result, and questions around fair use are largely up to the judge's opinion on four factors (purpose of use, nature of the work, amount of work used, and effect of use on the market).
TL;DR: Destroying the original had an effect on the judge's decision and increased the transformativeness of digitizing the books. They might have been fine without doing it, but the judge admitted that it was relevant to the question of fair use.
That is true, and they may have been doing to cover their asses, but I would bet they did the destructive method because it was faster or cheaper (or both). We will probably never know the minutia of that decision though