6

Website operators are being asked to feed LLM crawlers poisoned data by a project called Poison Fountain.

The project page links to URLs which provide a practically endless stream of poisoned training data. They have determined that this approach is very effective at ultimately sabotaging the quality and accuracy of AI which has been trained on it.

Small quantities of poisoned training data can significantly damage a language model.

The page also gives suggestions on how to put the provided resources to use.

top 50 comments
sorted by: hot top controversial new old
[-] termaxima@slrpnk.net 2 points 1 week ago

Been thinking about making one of these too, especially since I have a catchy name : asbestos

[-] ZILtoid1991@lemmy.world 1 points 1 week ago

Me too, but with procedural image generation. Use some templates which are put together with CPU blitter (extremely fast and effective), add some random descriptive text, then done. Don't know how much my theory would work IRL.

[-] eru@mouse.chitanda.moe 2 points 1 week ago

i would imagine companies would just filter it out

need some more clever way of hiding it or allow it to be self hosted so that it has various urls

[-] GamingChairModel@lemmy.world 2 points 1 week ago

If I am reading this correctly, anyone who wants to use this service can just configure their HTTP server to act as the man in the middle of the request, so that the crawler sees your URL but is retrieving poison fountain content from the poison fountain service.

If so, that means the crawlers wouldn't be able to filter by URL because the actual handler that responds to the HTTP request doesn't ever see the canonical URL of the poison fountain.

In other words, the handler is "self hosted" at its own URL while the stream itself comes from the same URL that the crawler never sees.

[-] CileTheSane@lemmy.ca 1 points 1 week ago

So it would be effective at preventing your site from being used as training data.

[-] vacuumflower@lemmy.sdf.org 1 points 1 week ago

If, suppose, I were optimistic over this technology, but pessimistic over its current stage of development, I'd expect this to be a cure. It's a problem they'll have to solve. A test they'll have to pass.

If somewhere inside those things someone makes a mechanism building a graph of syllogisms, no kind of poisoned input data will be able to hurt them.

So - this is a good thing, but when people say it's a rebellion, it's not.

[-] kadu@scribe.disroot.org 2 points 1 week ago

Samsung and Anthropic published independently created data showing how little bad data it takes to effectively poison very large models. LLMs pretend to be complex, but they aren't, they'll not continue to improve at the initial rate we got used to seeing. Just ask OpenAI.

[-] vacuumflower@lemmy.sdf.org 1 points 1 week ago

I'm not talking about LLMs. I'm talking about future developments learning on LLMs, eventually there will be some resolutions of conflicting knowledge and logical connections, otherwise they won't become remotely as useful as advertised.

Gotcha. So it's something you have imagined.

[-] vacuumflower@lemmy.sdf.org 0 points 1 week ago

It is called imagination if it has not yet happened.

[-] vacuumflower@lemmy.sdf.org 1 points 1 week ago

What if it has happened in my imagination?

[-] FlashMobOfOne@lemmy.world 1 points 1 week ago

A test they’ll have to pass.

This makes me chuckle, as they invented euphemisms like 'hallucinations' because their LLM models can't do what they promise. Fabulous marketing, but clearly they didn't do enough testing.

[-] Bazoogle@lemmy.world 0 points 1 week ago

as they invented euphemisms like ‘hallucinations’

Seems like a pretty accurate word to use, no? Could also use fabrication, concoction, phantom, or something else? I think "lie" and its synonyms are not accurate, since that requires intent. Since the LLM does not have intent, it cannot "lie".

[-] GamingChairModel@lemmy.world 2 points 1 week ago

That's why "bullshit," as defined by Harry Frankfurt, is so useful for describing LLMs.

A lie is a false statement that the speaker knows to be false. But bullshit is a statement made by a speaker who doesn't care if it's true or false.

[-] Disillusionist@piefed.world 1 points 1 week ago

Not all problems may be cured immediately. Battles are rarely won with a single attack. A good thing is not the same as nothing.

[-] Blackmist@feddit.uk 1 points 1 week ago

With the amount of AI generated horseshit out there already, they've already pissed in the well.

[-] vane@lemmy.world 1 points 1 week ago

I have around 10-20GB github / gitlab mirror. I am constantly under attack from crawlers from top US technology corporations and LLM startups. Whenever I ban one IP range they switch to other - I don't know if those fuckers have tickets in their systems to do it manually or they just deploy this shit all over the planet. From what I observe during attacks that I mitigate the best way to poison them is to just create gitea instance with poisoned code repository and couple hundred revisions. It's because what they are most interested in is html representation of diff between two git revisions.

[-] Lembot_0006@programming.dev 0 points 1 week ago

Idiots: This new technology is still quite ineffective. Let's sabotage it's improvement!

Imbeciles: Yeah!

[-] Disillusionist@piefed.world 0 points 1 week ago

AI companies could start, I don't know- maybe asking for permission to scrape a website's data for training? Or maybe try behaving more ethically in general? Perhaps then they might not risk people poisoning the data that they clearly didn't agree to being used for training?

[-] Lembot_0006@programming.dev 0 points 1 week ago

Why should they ask permission to read freely provided data? Nobody's asking for any permission, but LLM trainers somehow should? And what do you want from them from an ethical standpoint?

[-] GunnarGrop@lemmy.ml 1 points 1 week ago

Much of it might be freely available data, but there's a huge difference between you accessing a website for data and an LLM doing the same thing. We've had bots scraping websites since the 90's, it's not a new thing. And since scraping bots have existed we've developed a standard on the web to deal with it, called "robots.txt". A text file telling bots what they are allowed to do on websites and how they should behave.

LLM's are notorious for disrespecting this, leading to situations where small companies and organisations will have their websites scraped so thoroughly and frequently that they can't even stay online anymore, as well as skyrocketing their operational costs. In the last few years we've had to develop ways just to protect ourselves against this. See the "Anubis" project.

Hence, it's much more important that LLM's follow the rules than you and me doing so on an individual level.

It's the difference between you killing a couple of bees in your home versus an industry specialising in exterminating bees at scale. The efficiency is a big factor.

[-] BaroqueInMind@piefed.social 0 points 1 week ago

As someone who self-hosts a LLM and trains it on web data regularly to improve my model, I get where your frustration is coming from.

But engaging in discourse here where people already have a heavy bias against machine-learning language models is a fruitless effort. No one here is going to provide you catharsis with a genuine conversation that isnt rhetoric.

Just put the keyboard down and walk away.

[-] FaceDeer@fedia.io 1 points 1 week ago

I think it's worthwhile to show people that views outside of their like-minded bubble exist. One of the nice things about the Fediverse over Reddit is that the upvote and downvote tallies are both shown, so we can see that opinions are not a monolith.

Also, engaging in Internet debate is never to convince the person you're actually talking to. That almost never happens. The point of debate is to present convincing arguments for the less-committed casual readers who are lurking rather than participating directly.

[-] FaceDeer@fedia.io -1 points 1 week ago

Doesn't work, but I guess if it makes people feel better I suppose they can waste their resources doing this.

Modern LLMs aren't trained on just whatever raw data can be scraped off the web any more. They're trained with synthetic data that's prepared by other LLMs and carefully crafted and curated. Folks are still thinking ChatGPT 3 is state of the art here.

[-] Taldan@lemmy.world 1 points 1 week ago

Let's say I believe you. If that's the case, why are AI companies still scraping everything?

[-] Disillusionist@piefed.world 1 points 1 week ago

From what I've heard, the influx of AI data is one of the reasons actual human data is becoming increasingly sought after. AI training AI has the potential to become a sort of digital inbreeding that suffers in areas like originality and other ineffable human qualities that AI still hasn't quite mastered.

I've also heard that this particular approach to poisoning AI is newer and thought to be quite effective, though I can't personally speak to its efficacy.

[-] XLE@piefed.social 0 points 1 week ago

Do you have any basis for this assumption, FaceDeer?

Based on your pro-AI-leaning comments in this thread, I don't think people should accept defeatist rhetoric at face value.

[-] FaceDeer@fedia.io 0 points 1 week ago

A basic Google search for "synthetic data llm training" will give you lots of hits describing how the process goes these days.

Take this as "defeatist" if you wish, as I said it doesn't really matter. In the early days of LLMs when ChatGPT first came out the strategy for training these things was to just dump as much raw data onto them as possible and hope quantity allowed the LLM to figure something out from it, but since then it's been learned that quality is better than quantity and so training data is far more carefully curated these days. Not because there's "poison" in it, just because it results in better LLMs. Filtering out poison will happen as a side effect.

It's like trying to contaminate a city's water supply by peeing in the river upstream of the water treatment plant drawing from it. The water treatment plant is already dealing with all sorts of contaminants anyway.

[-] FauxLiving@lemmy.world 0 points 1 week ago

That may be an argument if only large companies existed and they only trained foundation models.

Scraped data is most often used for fine-tuning models for specific tasks. For example, mimicking people on social media to push an ad/political agenda. Using a foundational model that speaks like it was trained on a textbook doesn't work for synthesizing social media comments.

In order to sound like a Lemmy user, you need to train on data that contains the idioms, memes and conversational styles used in the Lemmy community. That can't be created from the output of other models, it has to come from scraping.

Poisoning the data going to the scrapers will either kill the model during training or force everyone to pre-process their data, which increases the costs and expertise required to attempt such things.

[-] FaceDeer@fedia.io 0 points 1 week ago

Are you proposing flooding the Fediverse with fake bot comments in order to prevent the Fediverse from being flooded with fake bot comments? Or are you thinking more along the lines of that guy who keeps using "Þ" in place of "th"? Making the Fediverse too annoying to use for bot and human alike would be a fairly phyrric victory, I would think.

[-] FauxLiving@lemmy.world 1 points 1 week ago

I am proposing neither of those things.

The way to effectively use this is to detect scraping through established means and, instead of banning them, altering the output to feed the target poisoned data instead of/in addition to the real content.

Banning a target gives them information about when they were detected and allows them to alter their profile to avoid that. If they're never banned then they lose that information and also they now have to deploy additional resources to attempt to detect and remove poisoned data.

Either way, it causes the adversary to spend a lot of resources at very little cost to you.

[-] FaceDeer@fedia.io 0 points 1 week ago

I have no idea what "established means" would be. In the particular case of the Fediverse it seems impossible, you can just set up your own instance specifically intended for harvesting comments and use that. The Fediverse is designed specifically to publish its data for others to use in an open manner.

[-] GamingChairModel@lemmy.world 0 points 1 week ago

The Fediverse is designed specifically to publish its data for others to use in an open manner.

Sure, and if the AI companies want to configure their crawlers to actually use APIs and ActivityPub to efficiently scrape that data, great. Problem is that there's been crawlers that have done things very inefficiently (whether by malice, ignorance, or misconfiguration) and scrape the HTML of sites repeatedly, driving up some hosting costs and effectively DOSing some of the sites.

If you put Honeypot URLs in the mix and keep out polite bots with robots.txt and keep out humans by hiding those links, you can serve poisoned responses only to the URLs that nobody should be visiting and not worry too much about collateral damage to legitimate visitors.

[-] FaceDeer@fedia.io 0 points 1 week ago

I have a sneaking suspicion that the vast majority of the people raging about AIs scraping their data are not raging about it being done inefficiently.

[-] badgermurphy@lemmy.world 1 points 1 week ago

Maybe not, but at least in part because they don't understand what the previous poster said. If their scrapers were more efficient at data harvesting by employing API calls instead of scraping your whole domain, it would be much less burdensome on the target's server resources and one would think they would be less annoyed by that than if the same thing had happened without that burden.

Their grievances with LLMs and their owners may not be limited to that, but they are certainly likely to include it.

[-] zr0@lemmy.dbzer0.com -1 points 1 week ago

This is just stupid^20

[-] nutsack@lemmy.dbzer0.com -2 points 1 week ago* (last edited 1 week ago)

I don't think this is a good idea. The pollution spreads. this would corrupt the collective knowledge of humanity a little faster than the AI already is doing.

[-] phoenixz@lemmy.ca 1 points 1 week ago

Nah, AI will already do that automatically because any and all system loses something in inefficiencies. Same like if you put a theoretical 100 miles of gas worth in your tank that turns into 20 in practice because the combustion engine has an efficiency of 30ish%, you have air and tire resistance, etc etc.

AI has the same for information, and what comes out is always a certain fraction of the 100% that went in

Since poisoning the pool makes AI unreliable up to the point where it becomes useless, it has the potential to stop the AI madness. I'd be all for that.

load more comments
view more: next ›
this post was submitted on 13 Jan 2026
6 points (100.0% liked)

Technology

79228 readers
205 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related news or articles.
  3. Be excellent to each other!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
  9. Check for duplicates before posting, duplicates may be removed
  10. Accounts 7 days and younger will have their posts automatically removed.

Approved Bots


founded 2 years ago
MODERATORS