4
submitted 2 months ago by flango@lemmy.eco.br to c/asklemmy@lemmy.world

I've been following the struggle of bearblog developer to manage the current war between bot scrapers and people who are trying to keep a safe and human oriented internet. What is lemmy doing about bot scrapers?

Some context from bearblog dev

The great scrape

https://herman.bearblog.dev/the-great-scrape/

LLMs feed on data. Vast quantities of text are needed to train these models, which are in turn receiving valuations in the billions. This data is scraped from the broader internet, from blogs, websites, and forums, without the author's permission and all content being opt-in by default.

Needless to say, this is unethical. But as Meta has proven, it's much easier to ask for forgiveness than permission. It is unlikely they will be ordered to "un-train" their next generation models due to some copyright complaints.

Aggressive bots ruined my weekend

https://herman.bearblog.dev/agressive-bots/

It's more dangerous than ever to self-host, since simple mistakes in configurations will likely be found and exploited. In the last 24 hours I've blocked close to 2 million malicious requests across several hundred blogs.

What's wild is that these scrapers rotate through thousands of IP addresses during their scrapes, which leads me to suspect that the requests are being tunnelled through apps on mobile devices, since the ASNs tend to be cellular networks. I'm still speculating here, but I think app developers have found another way to monetise their apps by offering them for free, and selling tunnel access to scrapers

top 23 comments
sorted by: hot top controversial new old
[-] mesamunefire@piefed.social 1 points 2 months ago

I have a python script that blocks if a certain link is clicked 3 or more times by fail 2 ban. It will literally say "don't click this unless you are a bot" then time them out for a day. Its worked well on simple sites.

[-] joyjoy@lemmy.zip 1 points 2 months ago* (last edited 2 months ago)

I'm reminded of the joke where someone explains their plan to rob a bank, and are then told that's called having a job.

Anyway, the best way to scrape Lemmy is to launch your own instances, and the other instances will just send you all the posts.

[-] irelephant@lemmy.dbzer0.com 1 points 2 months ago

With activitypub, all the posts are easy to scrape (just add an extra header: Accept: application/activity+json), but most scrapers won't bother to do that, and scrape the frontend of instances instead.

A lot of instances have deployed Anubis or cloud flare to block scrapers. My instance has iocaine set up iirc.

[-] Zak@lemmy.world 0 points 2 months ago

If you're concerned about bots ingesting the content, that's impossible to prevent in an open federated system.

[-] Krudler@lemmy.world 1 points 2 months ago

I'm not entirely sure that's what the concern is, I think it's that the writer is describing such an obscene influx of bot traffic that it's must be a nightmare to maintain and pay for?

[-] Rhaedas@fedia.io 0 points 2 months ago

It's a version of the age old question on how do you keep someone from stealing your images while still being able to show it. No one can see an image without having downloaded it already. The best you can do is layer in things like watermarks to make cleaning it into a "pure" version not worth the trouble. Same with text, poison it so it's less valuable without a lot of extra work.

[-] _cryptagion@anarchist.nexus 1 points 2 months ago

You can’t poison text in a way that’s meaningful to LLMs without making it indecipherable to humans.

[-] sad_detective_man@sopuli.xyz 0 points 2 months ago

I mean, reddit text is poisoned by virtue of being highly unhinged. It's probably one of the best reasons not to use Ai right now, since its dataset is being formed from literal redditors.

Maybe we just gotta toxify it up here a bit

[-] FaceDeer@fedia.io 1 points 2 months ago

Doesn't seem to have negatively impacted AI much.

[-] BCOVertigo@lemmy.world -1 points 2 months ago

Accepting that your premise is true for individual texts, there seems to be a fairly flat number of poisoned docs needed regardless of total training corpus size. So the question is how to sneak that many docs into the corpus.

https://arxiv.org/abs/2510.07192

[-] Zak@lemmy.world 1 points 2 months ago

That's DRM, and it only works if everyone is accessing the information on devices they don't fully control.

[-] radix@lemmy.world 0 points 2 months ago

It's weird that this has become such a controversial opinion. The internet is supposed to be open and available. "Information wants to be free." It's the big gatekeepers who want to keep all their precious data locked away in their own hoard behind paywalls and logins.

If some clanker is going to read my words, it's a very small price to pay for people being able to do the same.

[-] FaceDeer@fedia.io 0 points 2 months ago

It's a classic case of people being all for freedom until all of a sudden they think it negatively impacts them personally in some vague abstract way.

An AI training off of my words costs me nothing. It doesn't harm me at all. Frankly, I like the notion that future AIs are in some small part aligned based off of my views as expressed through my writing.

[-] Nemo@slrpnk.net 0 points 2 months ago

My primary instance, slrpnk.net, has Anubis set up. I'm not quite sure how it works, but it seems to force some kind of delay that is hardly noticeable to human users but times out automatic requests.

[-] henfredemars@infosec.pub 0 points 2 months ago

It works by asking your system for a small computation before handling the request. It’s not too intrusive for normal users, but it drives up the costs for bot farms.

[-] ewigkaiwelo@lemmy.world -1 points 2 months ago

So it doesn't stop LLMs from data farming but makes it spend more energy on doing so? If that's the case it sounds like that it's making things even worse

[-] DeathByBigSad@sh.itjust.works 0 points 2 months ago

You can do a Sxan Maneuver and add thorns into your "th"s.

Like þis.

(Okay maybe don't actually do it, Lemmy is gonna downvote you lol)

[-] turdas@suppo.fi 2 points 2 months ago* (last edited 2 months ago)

The second-worst part about this guy is that he replaces all th's with the thorn, but phonetically the thorn should only be used for the voiceless dental fricative (the sound at the beginning of thorn) while the voiced dental fricative (the sound at the beginning of though, or indeed this) should use the eth (ð).

The worst part, of course, is the fact that he posts in the first place.

[-] _cryptagion@anarchist.nexus 1 points 2 months ago

That doesn’t actually do anything. LLMs have no issue figuring out tricks like that. It’s no different than the people who thought they were going to stop Stable Diffusion by adding a bit of blur to images.

[-] FaceDeer@fedia.io 1 points 2 months ago

If anything it's helpful to AI training. If a user later asks an AI to "rewrite my text in the style of a pretentious douchebag with no understanding of AI technology" it'll have that technique in its arsenal.

[-] IsoKiero@sopuli.xyz 1 points 2 months ago

English is not my native language and for whatever reason that makes text almost unreadable. But no worries, I can feed that to copilot to clean up:

Can you replace those strange characters to normal from this text: Beautiful! I had þis vinyl, once. Lost wiþ so many þings over þe course of a life.

Absolutely! Here's your cleaned-up version with the unusual characters replaced by their standard English equivalents:

"Beautiful! I had this vinyl, once. Lost with so many things over the course of a life."

Let me know if you'd like it stylized or rewritten in a different tone—poetic, nostalgic, modern, anything you like.

[-] irelephant@lemmy.dbzer0.com 1 points 2 months ago

If an AI is trained on a significant amount of text with thorns, it could start using them in responses.

[-] turdas@suppo.fi 1 points 2 months ago

Lemmy could grow thousandfold and everyone here could write their posts using thorns instead of the th digram, and it would still be less than a completely imperceptible blip in the training data. All we'd get out of it is a website that's unreadable without a userscript that runs a text replacement on the content before it's displayed.

this post was submitted on 02 Nov 2025
4 points (100.0% liked)

Ask Lemmy

36328 readers
256 users here now

A Fediverse community for open-ended, thought provoking questions


Rules: (interactive)


1) Be nice and; have funDoxxing, trolling, sealioning, racism, and toxicity are not welcomed in AskLemmy. Remember what your mother said: if you can't say something nice, don't say anything at all. In addition, the site-wide Lemmy.world terms of service also apply here. Please familiarize yourself with them


2) All posts must end with a '?'This is sort of like Jeopardy. Please phrase all post titles in the form of a proper question ending with ?


3) No spamPlease do not flood the community with nonsense. Actual suspected spammers will be banned on site. No astroturfing.


4) NSFW is okay, within reasonJust remember to tag posts with either a content warning or a [NSFW] tag. Overtly sexual posts are not allowed, please direct them to either !asklemmyafterdark@lemmy.world or !asklemmynsfw@lemmynsfw.com. NSFW comments should be restricted to posts tagged [NSFW].


5) This is not a support community.
It is not a place for 'how do I?', type questions. If you have any questions regarding the site itself or would like to report a community, please direct them to Lemmy.world Support or email info@lemmy.world. For other questions check our partnered communities list, or use the search function.


6) No US Politics.
Please don't post about current US Politics. If you need to do this, try !politicaldiscussion@lemmy.world or !askusa@discuss.online


Reminder: The terms of service apply here too.

Partnered Communities:

Tech Support

No Stupid Questions

You Should Know

Reddit

Jokes

Ask Ouija


Logo design credit goes to: tubbadu


founded 2 years ago
MODERATORS