Revealed: The Authors Whose Pirated Books Are Powering Generative AI (www.theatlantic.com)

submitted 1 year ago by Powderhorn@beehaw.org to c/technology@beehaw.org

35 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[-] dingus@lemmy.ml 21 points 1 year ago* (last edited 1 year ago)

See, I thought this was well-known when Books3 was dumped online in 2020.

https://twitter.com/theshawwn/status/1320282149329784833

This is even referenced in the article.

I guess maybe people were shocked it was really "all of Bibliotik" because they couldn't believe someone could actually manage to keep a decent share ratio on that fucking site to not get kicked off, especially while managing to download the whole corpus. /s (I don't know this from personal experience or anything.)

In all seriousness, however, it's been well known for a while now that these models were being trained on copyrighted books, and the companies trying to hide their faces over it are a joke.

It's just like always, copyright is used to punish regular ass people, but when corporations trash copyright, its all "whoopsie doodles, can't you just give us a cost-of-doing-business-fine and let us continue raping the public consciousness for a quick buck?" Corporations steal copyrighted material all the time, but regular ass people don't have the money to fight it. Hiding behind Fair Use while they are using it to make a profit isn't just a joke but a travesty and the ultimate in twisting language to corporate ends.

They may have bitten off more than they can chew here, though, possibly paving way for a class-action lawsuit from writers and publishers.

[-] MagicShel@programming.dev 5 points 1 year ago* (last edited 1 year ago)

Seems like a clearly transformative work that would be covered under fair use. As an aside, I've been using AI as an writing assistant/solitary roleplaying GM for several years now and the quality of the prose can be quite good, but the authorship of stories is terrible and I can't say they even do a good job of emulating a particular author's style.

[-] knotthatone@lemmy.one 13 points 1 year ago

Clearly transformative only applies to the work a human has put in to the process. It isn't at all clear that an LLM would pass muster for a fair use defense, but there are court cases in progress that may try to answer that question. Ultimately, I think what it's going to come down to is whether the training process itself and the human effort involved in training the model on copyrighted data is considered transformative enough to be fair use, or doesn't constitute copying at all. As far as I know, none of the big cases are trying the "not a copy" defense, so we'll have to see how this all plays out.

In any event, copyright laws are horrifically behind the times and it's going to take new legislation sooner or later.

[-] jarfil@beehaw.org 6 points 1 year ago

My bet is: it's going to depend on a case by case basis.

A large enough neural network can be used to store, and then recover, a 1:1 copy of a work... but a large enough corpus can contain more data that could ever be stored in a given size neural network, even if some fragments of the input work could be recovered... so it will depend on how big of a recoverable fragment is "big enough" to call it copyright infringement... but then again, reproducing up to a whole work is considered fair use for some purposes... but not in every country.

Copyright laws are not necessarily wrong; just remove the "until author's death plus 70 years" coverage, go back to a more reasonable "4 years since publication", and they make much more sense.

[-] knotthatone@lemmy.one 3 points 1 year ago

My bet is: it’s going to depend on a case by case basis.

Almost certainly. Getty images has several exhibits in its suit against Stable Diffusion showing the Getty watermark popping up in its output as well as several images that are substantially the same as their sources. Other generative models don't produce anything all that similar to the source material, so we're probably going to wind up with lots of completely different and likely contradictory rulings on the matter before this gets anywhere near being sorted out legally.

Copyright laws are not necessarily wrong; just remove the “until author’s death plus 70 years” coverage, go back to a more reasonable “4 years since publication”, and they make much more sense.

The trouble with that line of thinking is that the laws are under no obligation to make sense. And the people who write and litigate those laws benefit from making them as complicated and irrational as they can get away with.

[-] jarfil@beehaw.org 2 points 1 year ago* (last edited 1 year ago)

In this case the Mickey Mouse Curve makes sense, just bad sense. At least the EU didn't make it 95 years, and compromised on also 70... 🙄

[-] MagicShel@programming.dev 3 points 1 year ago

I agree with that. And you're right that it's currently in the hands of the courts. I'm not a copyright expert and I'm sure there are nuances I don't grasp - I didn't know fair use requires specifically human transformation if that is indeed the case. We'll just have to see in the end whose layman's interpretation turns out to be correct. I just enjoy the friendly, respectful collective speculation and knowledge sharing.

load more comments (15 replies)

this post was submitted on 21 Aug 2023

99 points (100.0% liked)

Technology

37720 readers

443 users here now

A nice place to discuss rumors, happenings, innovations, and challenges in the technology sphere. We also welcome discussions on the intersections of technology and society. If it’s technological news or discussion of technology, it probably belongs here.

Remember the overriding ethos on Beehaw: Be(e) Nice. Each user you encounter here is a person, and should be treated with kindness (even if they’re wrong, or use a Linux distro you don’t like). Personal attacks will not be tolerated.

Subcommunities on Beehaw:

This community's icon was made by Aaron Schneider, under the CC-BY-NC-SA 4.0 license.

founded 2 years ago

MODERATORS

alyaza@beehaw.org

TheRtRevKaiser@beehaw.org

gyrfalcon@beehaw.org

rs5th@beehaw.org

coldredlight@beehaw.org

Los@beehaw.org

SemioticStandard@beehaw.org

TheRtRevKaiser@kbin.social

remington@beehaw.org