Amazon- and Google-backed AI firm Anthropic says “general-purpose AI tools simply could not exist” if AI companies had to pay licences for the training material (www.computerweekly.com)

submitted 2 years ago by 0x815@feddit.de to c/technology@beehaw.org

73 comments fedilink hide all child comments

Generative artificial intelligence (GenAI) company Anthropic has claimed to a US court that using copyrighted content in large language model (LLM) training data counts as “fair use”, however.

Under US law, “fair use” permits the limited use of copyrighted material without permission, for purposes such as criticism, news reporting, teaching, and research.

In October 2023, a host of music publishers including Concord, Universal Music Group and ABKCO initiated legal action against the Amazon- and Google-backed generative AI firm Anthropic, demanding potentially millions in damages for the allegedly “systematic and widespread infringement of their copyrighted song lyrics”.

(page 2) 30 comments

sorted by: hot top controversial new old

[-] 0x815@feddit.de 2 points 2 years ago

Data Leak at Anthropic Due to Contractor Error

TL;DR - Anthropic had a data leak due to a contractor’s mistake, but says no sensitive info was exposed. It wasn’t a system breach, and there’s no sign of malicious intent.

[-] lvxferre@mander.xyz 2 points 2 years ago

Most things that I could talk about were already addressed by other users (specially @OttoVonNoob@lemmy.ca), so I'll address a specific point - better models would skip this issue altogether.

The current models are extremely inefficient on their usage of training data. LLMs are a good example; Claude v2.1 was allegedly trained on hundreds of billions of words. In the meantime, it's claimed that a 4yo child hears something between 45 millions and 13 millions words through their still short life. It's four orders of magnitude of difference, so even if someone claims that those bots are as smart as a 4yo*, they're still chewing through the training data without using it efficiently.

Once this is solved, the corpus size will get way, way smaller. Then it would be rather feasible to train those models without offending the precious desire for greed of the American media mafia, in a way that still fulfils the entitlement of the GAFAM mafia.

*I seriously doubt that, but I can't be arsed to argue this here - it's a drop in a bucket.

[-] sonori@beehaw.org 3 points 2 years ago

The thing is, i’m not sure at all that it’s even physically possible for an LLM be trained like a four year old, they learn in fundamentally different ways. Even very young children quickly learn by associating words with concepts and objects, not by forming a statistical model of how often x mingingless string of characters comes after every other meaningless string of charecters.

Similarly when it comes to image classifiers, a child can often associate a word to concept or object after a single example, and not need to be shown hundreds of thousands of examples until they can create a wide variety of pixel value mappings based on statistical association.

Moreover, a very large amount of the “progress” we’ve seen in the last few years has only come by simplifying the transformers and useing ever larger datasets. For instance, GPT 4 is a big improvement on 3, but about the only major difference between the two models is that they threw near the entire text internet at 4 as compared to three’s smaller dataset.

[-] lvxferre@mander.xyz 3 points 2 years ago

My point is that the current approach - statistical association - is so crude that it'll probably get ditched in the near future anyway, with or without licencing matters. And that those better models (that won't be LLMs or diffusion-based) will probably skip this issue altogether.

The comparison with 4yos is there mostly to highlight how crude it is. I don't think either that it's viable to "train" models in the same way as we'd train a human being.

[-] Midnitte@beehaw.org 2 points 2 years ago

Interesting that Anthropic is making this argument, considering their story in the AI space. They're certainly no OpenAI.

[-] EmergMemeHologram@startrek.website 1 points 2 years ago* (last edited 2 years ago)

Google and Amazon both have massive corpuses of this data that they would allow only themselves to use.

Anthropic isn’t saying this to help content creators, they’re saying this to kill OpenAI so they don’t have to actually compete

load more comments

this post was submitted on 29 Jan 2024

87 points (100.0% liked)

Technology

39806 readers

111 users here now

A nice place to discuss rumors, happenings, innovations, and challenges in the technology sphere. We also welcome discussions on the intersections of technology and society. If it’s technological news or discussion of technology, it probably belongs here.

Remember the overriding ethos on Beehaw: Be(e) Nice. Each user you encounter here is a person, and should be treated with kindness (even if they’re wrong, or use a Linux distro you don’t like). Personal attacks will not be tolerated.

Subcommunities on Beehaw:

This community's icon was made by Aaron Schneider, under the CC-BY-NC-SA 4.0 license.

founded 3 years ago

MODERATORS

alyaza@beehaw.org

TheRtRevKaiser@beehaw.org

gyrfalcon@beehaw.org

rs5th@beehaw.org

coldredlight@beehaw.org

SemioticStandard@beehaw.org

TheRtRevKaiser@kbin.social

remington@beehaw.org