436

Meta admits using pirated books to train AI, but won't pay for it (www.techspot.com)

submitted 2 years ago by Flatworm7591@lemmy.dbzer0.com to c/piracy@lemmy.dbzer0.com

65 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[-] Natanael@slrpnk.net 1 points 2 years ago* (last edited 2 years ago)

Humans learn a lot through repetition, no reason to believe that LLMs wouldn't benefit from reinforcement of higher quality information. Especially because seeing the same information in different contexts helps mapping the links between the different contexts and helps dispel incorrect assumptions. But like I said, the only viable method they have for this kind of emphasis at scale is incidental replication of more popular works in its samples. And when something is duplicated too much it overfits instead.

They need to fundamentally change big parts of how learning happens and how the algorithm learns to fix this conflict. In particular it will need a lot more "introspective" training stages to refine what it has learned, and pretty much nobody does anything even slightly similar on large models because they don't know how, and it would be insanely expensive anyway.

[-] FaceDeer@fedia.io 1 points 2 years ago

Especially because seeing the same information in different contexts helps mapping the links between the different contexts and helps dispel incorrect assumptions.

Yes, but this is exactly the point of deduplication - you don't want identical inputs, you want variety. If you want the AI to understand the concept of cats you don't keep showing it the same picture of a cat over and over, all that tells it is that you want exactly that picture. You show it a whole bunch of different pictures whose only commonality is that there's a cat in it, and then the AI can figure out what "cat" means.