AI models feeding on AI data will lead to 'model collapse', researchers say (web.archive.org)

submitted 2 years ago by 0x815@feddit.de to c/technology@beehaw.org

15 comments fedilink hide all child comments

Using model-generated content in training causes irreversible defects, a team of researchers says. "The tails of the original content distribution disappears," writes co-author Ross Anderson from the University of Cambridge in a blog post. "Within a few generations, text becomes garbage, as Gaussian distributions converge and may even become delta functions."

Here's is the study: http://web.archive.org/web/20230614184632/https://arxiv.org/abs/2305.17493

all 16 comments

sorted by: hot top controversial new old

[-] FaceDeer@kbin.social 10 points 2 years ago

So the only Reddit data that's really valuable to AI companies is the stuff that's already been archived, and the stuff that will be gated behind their paid API is less useful now that bots are becoming prevalent?

Good move, Reddit.

[-] dan96kid@beehaw.org 8 points 2 years ago

This reminds me of a saying from my programming classes: Garbage in, garbage out. Refers to how inputting bad data WILL make the program produce even more bad data

[-] fragmentcity@lemm.ee 6 points 2 years ago

So basically we are sitting in a room.

[-] Metacortechs@lemmy.stellarvortex.com 2 points 2 years ago

That was really cool, thank you!

[-] LoreleiSankTheShip@lemmy.ml 2 points 2 years ago

That was trippy af

[-] Roberto@toast.ooo 1 points 2 years ago

That was interesting as duck

[-] coolin@beehaw.org 0 points 2 years ago

This isn't an actual problem. Can you train on post-ChatGPT internet text? No, but you can train on the pre-ChatGPT common crawls, the millions of conversations people have with the models and on audio, video and images. As we improve training techniques and model architectures, we will need even less of this data to train even more performant models.

[-] interolivary@beehaw.org 7 points 2 years ago

But then you're training on more and more outdated data

[-] Lowbird@beehaw.org 7 points 2 years ago

Both in terms of factual information, news, etc, and just in terms of language change. An LLM needs to be able to keep up with slang and other new words, both for understanding prompts and for producing passable results.

[-] Kerb@discuss.tchncs.de 1 points 2 years ago

Afaik, there are already solution to that.

You first train the data on the outdated but correct data, to establish the correct "thought" patterns.

And then you can train the ai on the fresh but flawed data, without tripping about the mistakes.

[-] artificial_unintelligence@programming.dev 6 points 2 years ago

I think it’s not a hard stop but it is an issue. I think it will force models to be trained in more novel ways, rather than just purely pump more data in. I think ideally we’d be able to reach GPT level intelligence on fractions of the data and compute. These new techniques have yet to be made but this will put pressure on their creation

[-] Lowbird@beehaw.org 7 points 2 years ago

I think that's a tremendously tall order. The current LLM's are straightforwardly Large Language Models that have zero ability to understand the language and only sort it based on statistical models that can only be gleaned via a vast heap of data. Reducing the size of any data set increases the likelihood of bias and blindspots no matter what you do.

At the least, an LLM cannot talk about anything (like news events, new inventions, new political ideas) until humans have talked about it first AND their talking about it has been put into the dataset. If something's not in the dataset, an LLM simply can't invent it. At absolute best, it'll spit out plausible-sounding bullshit.

Inventing actual, truely intelligent AI is a project very far remove from what we have now. It'd take the invention of entirely different systems, not at all just an iterative improvement of an LLM.

[-] Steeve@lemmy.ca 1 points 2 years ago

It's an interesting problem for machine learning engineers to solve, but yeah it's not apocalyptic for LLMs or anything. Probably just another preprocessing stage if anything.

this post was submitted on 20 Jun 2023

50 points (100.0% liked)

Technology

41034 readers

433 users here now

A nice place to discuss rumors, happenings, innovations, and challenges in the technology sphere. We also welcome discussions on the intersections of technology and society. If it’s technological news or discussion of technology, it probably belongs here.

Remember the overriding ethos on Beehaw: Be(e) Nice. Each user you encounter here is a person, and should be treated with kindness (even if they’re wrong, or use a Linux distro you don’t like). Personal attacks will not be tolerated.

Subcommunities on Beehaw:

This community's icon was made by Aaron Schneider, under the CC-BY-NC-SA 4.0 license.

founded 3 years ago

MODERATORS

alyaza@beehaw.org

TheRtRevKaiser@beehaw.org

gyrfalcon@beehaw.org

rs5th@beehaw.org

coldredlight@beehaw.org

SemioticStandard@beehaw.org

TheRtRevKaiser@kbin.social

remington@beehaw.org