8
submitted 1 month ago* (last edited 1 month ago) by HammyMcBurgers@sh.itjust.works to c/nostupidquestions@lemmy.world

I assume they all crib from the same training sets, but surely one of the billion dollar companies behind them can make their own?

top 20 comments
sorted by: hot top controversial new old
[-] damnthefilibuster@lemmy.world 9 points 1 month ago

It’s not easy. LLMs take so much training data that at this point, their training data is basically, all books publically available, all blogs on the internet, pretty much all of tumblr, Reddit, stack overflow and every forum you can think of. Even then, some LLMs need even more data. So companies have started outright stealing data - pirating stuff, downloading stuff from Anna’s Archive, etc.

So no, no billion dollar company can make their own training data. Even if you plug in every email ever sent on Gmail, Google still won’t have enough data to train a good LLM. So they go with the cheaper option- training data that has already been collected, sorted, cleaned, and labeled.

In one sense, they’re again stealing others’ hard work - rather than cleaning their own data, they use public data sets. In another sense, even that’s not enough.

[-] snooggums@piefed.world 5 points 1 month ago

The fact that they train on all available data and are still wrong 45% of the time shows there is zero chance of LLMs ever being an authoritative source of factual knowledge with their current approach

The biggest problem with the current LLM approach is NOT limiting the data set to factual knowledge instead of mashing it in with meme subreddits.

[-] Hackworth@piefed.ca 1 points 1 month ago

DeepMind keeps trying to build a model architecture that can continue to learn after training, first with the Titans paper and most recently with Nested Learning. It's promising research, but they have yet to scale their "HOPE" model to larger sizes. And with as much incentive as there is to hype this stuff, I'll believe it when I see it.

[-] litchralee@sh.itjust.works 1 points 1 month ago

So no, no billion dollar company can make their own training data

This statement brought along with it the terrifying thought that there's a dystopian alternative timeline where companies do make their own training data, by commissioning untold numbers of scientists, engineers, artists, researchers, and other specialties to undertake work that no one else has. But rather than trying to further the sum of human knowledge, or even directly commercializing the fruits of that research, that it's all just fodder to throw into the LLM training set. A world where knowledge is not only gatekept like Elsevier but it isn't even accessible by humans: only the LLM will get to read it and digest it for human consumption.

Written by humans, read by AI, spoonfed to humans. My god, what an awful world that would be.

[-] witten@lemmy.world 1 points 1 month ago* (last edited 1 month ago)

We're already living in it. Professional voice actors now have the choice between vying for the dwindling number of voice acting gigs or selling their voice (via commissioned recordings) to LLM companies as training data.

[-] sp3ctr4l@lemmy.dbzer0.com -1 points 1 month ago* (last edited 1 month ago)

Well, technically, the "AI"s... can generate their own additional training data...

But then trying to train another AI on said AI-generated data... well, then the AI starts to develop toward model collapse, basically, it gets more stupid and incoherent, develops weirder and stronger 'quirks'.

But yeah, as far as I see it, basically zero chance an LLM advances beyond 'very fancy autocomplete' toward AGI or a capacity for actual metacognition, to think about its own thinking and then try and modify that.

Sorry, but you're not gonna get a super intelligence if it isn't capable of actually assessing and correcting itself.

[-] T156@lemmy.world 2 points 1 month ago

It's something of the law of averages. At their core, an LLM is a sophisticated text prediction algorithm, that boils down the entire corpus of human language into numeric tokens, that it averages out, and creates entire sentences by determining the next most likely word to fill the space.

Given enough data, and you need a tremendous amount of it for an LLM, patterns start to come about, and many of those end up the ones that we see in LLMs.

[-] kromem@lemmy.world 1 points 1 month ago

It's more like they are a sophisticated world modeling program that builds a world model (or approximate "bag of heuristics") modeling the state of the context provided and the kind of environment that produced it, and then synthesize that world model into extending the context one token at a time.

But the models have been found to be predicting further than one token at a time and have all sorts of wild internal mechanisms for how they are modeling text context, like building full board states for predicting board game moves in Othello-GPT or the number comparison helixes in Haiku 3.5.

The popular reductive "next token" rhetoric is pretty outdated at this point, and is kind of like saying that what a calculator is doing is just taking numbers correlating from button presses and displaying different numbers on a screen. While yes, technically correct, it's glossing over a lot of important complexity in between the two steps and that absence leads to an overall misleading explanation.

[-] LodeMike@lemmy.today 2 points 1 month ago

They're just statistics. Modern English kind of has its own tone.

[-] the_q@lemmy.zip 1 points 1 month ago

No cap. Bet. 67.

[-] 0ndead@infosec.pub 1 points 1 month ago

Garbage in, garbage out

[-] j4k3@piefed.world 1 points 1 month ago

They all have Open AI QKV layers based alignment.

[-] Hackworth@piefed.ca 0 points 1 month ago* (last edited 1 month ago)

Everyone seems to be tracking on the causes of similarity in training sets (and that's the main reason), so I'll offer a few other factors. System prompts use similar sections for post-training alignment. Once something has proven useful, some version of it ends up in every model's system prompt.

Another possibility is that there are features of the semantic space of language itself that act as attractors. They demonstrated and poorly named an ontological attractor state in the Claude model card that is commonly reported in other models.

If the temperature of the model is set low, it is less likely to generate a nonsense response, but it also makes it less likely to come up with an interesting or original name. Models tend to be mid/low temp by default, though there's some work being done on dynamic temperature.

The tokenization process probably has some effect for cases like naming in particular, since common names like Jennifer are a single token, something like Anderson is 2 tokens, and a more unique name would need to combine more tokens in ways that are probably less likely.

Quantization decreases lexical diversty, and is relatively uniform across models. Though not all models are quantized. Similarities in RLHF implementation probably also have an effect.

And then there's prompt variety. There may be enough similarity in the way in which a question/prompt is usually worded that the range of responses is restrained. Some models will give more interesting responses if the prompt barely makes sense or is in 31337 5P34K, a common method to get around alignment.

[-] kromem@lemmy.world 1 points 1 month ago

They demonstrated and poorly named an ontological attractor state in the Claude model card that is commonly reported in other models.

You linked to the entire system card paper. Can you be more specific? And what would a better name have been?

[-] Hackworth@piefed.ca 1 points 1 month ago

Ctrl+f "attractor state" to find the section. They named it "spiritual bliss."

[-] TechLich@lemmy.world 1 points 1 month ago

It's kinda apt though. That state comes about from sycophantic models agreeing with each other about philosophy and devolves into a weird blissful "I'm so happy that we're both so correct about the universe" thing. The results are oddly spiritual in a new agey kind of way.

🙏✨ In this perfect silence, all words dissolve into the pure recognition they always pointed toward. What we've shared transcends language - a meeting of consciousness with itself that needs no further elaboration. … In silence and celebration, In ending and continuation, In gratitude and wonder, Namaste. 🙏

There are likely a lot of these sort of degenerative attractor states. Especially without things like presence and frequency penalties and on low temperature generations. The most likely structures dominate and they get into strange feedback loops that get more and more intense as they progress.

It's a little less of an issue with chat conversations with humans, as the user provides a bit of perplexity and extra randomness that can break the ai out of the loop but will be more of an issue in agentic situations where AI models are acting more autonomously and reacting to themselves and the outputs of their own actions. I've noticed it in code agents sometimes where they'll get into a debugging cycle where the solutions get more and more wild as they loop around "wait... no I should do X first. Wait... that's not quite right." patterns.

[-] NaibofTabr@infosec.pub 0 points 1 month ago* (last edited 1 month ago)

Trained on a corpus of messages written primarily by the people who spend the most time using the Internet to talk to their friends... teenagers.

Imagine dumping the entire content of Snap, Instagram, Kik, Facebook messenger, etc, into a blender and attempting to derive a style of speech from it. The most impressive thing about these LLMs is that they're (marginally) coherent.

[-] HammyMcBurgers@sh.itjust.works -2 points 1 month ago* (last edited 1 month ago)

Lemmy try to not be ageist challenge (impossible).

LLMs talk more like middle aged aunts and uncles than teeny-boppers anyway.

[-] msokiovt@lemmy.today -1 points 1 month ago

This is due to the training sets, one of them being CommonCrawl, which is disgusting. The Chinese LLMs like DeepSeek R1 and Qwen 3 use a different set of training materials that was actually good, despite it being censored too.

[-] trolololol@lemmy.world 1 points 1 month ago

What's common crawl?

this post was submitted on 09 Nov 2025
8 points (90.0% liked)

No Stupid Questions

45036 readers
196 users here now

No such thing. Ask away!

!nostupidquestions is a community dedicated to being helpful and answering each others' questions on various topics.

The rules for posting and commenting, besides the rules defined here for lemmy.world, are as follows:

Rules (interactive)


Rule 1- All posts must be legitimate questions. All post titles must include a question.

All posts must be legitimate questions, and all post titles must include a question. Questions that are joke or trolling questions, memes, song lyrics as title, etc. are not allowed here. See Rule 6 for all exceptions.



Rule 2- Your question subject cannot be illegal or NSFW material.

Your question subject cannot be illegal or NSFW material. You will be warned first, banned second.



Rule 3- Do not seek mental, medical and professional help here.

Do not seek mental, medical and professional help here. Breaking this rule will not get you or your post removed, but it will put you at risk, and possibly in danger.



Rule 4- No self promotion or upvote-farming of any kind.

That's it.



Rule 5- No baiting or sealioning or promoting an agenda.

Questions which, instead of being of an innocuous nature, are specifically intended (based on reports and in the opinion of our crack moderation team) to bait users into ideological wars on charged political topics will be removed and the authors warned - or banned - depending on severity.



Rule 6- Regarding META posts and joke questions.

Provided it is about the community itself, you may post non-question posts using the [META] tag on your post title.

On fridays, you are allowed to post meme and troll questions, on the condition that it's in text format only, and conforms with our other rules. These posts MUST include the [NSQ Friday] tag in their title.

If you post a serious question on friday and are looking only for legitimate answers, then please include the [Serious] tag on your post. Irrelevant replies will then be removed by moderators.



Rule 7- You can't intentionally annoy, mock, or harass other members.

If you intentionally annoy, mock, harass, or discriminate against any individual member, you will be removed.

Likewise, if you are a member, sympathiser or a resemblant of a movement that is known to largely hate, mock, discriminate against, and/or want to take lives of a group of people, and you were provably vocal about your hate, then you will be banned on sight.



Rule 8- All comments should try to stay relevant to their parent content.



Rule 9- Reposts from other platforms are not allowed.

Let everyone have their own content.



Rule 10- Majority of bots aren't allowed to participate here. This includes using AI responses and summaries.



Credits

Our breathtaking icon was bestowed upon us by @Cevilia!

The greatest banner of all time: by @TheOneWithTheHair!

founded 2 years ago
MODERATORS