Based on this graph, and this graph alone, guess at what time I completely blocked OpenAI crawlers (jlai.lu)

submitted 1 week ago by hylobates@jlai.lu to c/selfhosted@lemmy.world

39 comments fedilink hide all child comments

I really hope they die soon, this is unbearable…

all 40 comments

sorted by: hot top controversial new old

[-] ptz@dubvee.org 22 points 1 week ago

I was blocking them but decided to shunt their traffic to Nepenthes instead. There's usually 3-4 different bots thrashing around in there at any given time.

If you have the resources, I highly recommend it.

[-] Petter1@discuss.tchncs.de 19 points 1 week ago

Reference for lazy ones: https://zadzmo.org/code/nepenthes/

[-] michael@piefed.chrisco.me 11 points 1 week ago

Oh interesting! Ive done something similar but not didnt put as much effort.

For me, I just made an unending webpage that would create a link to another page...that would say bullshit. Then it would have another link with more bullshit....etc...etc...And it gets slower as time goes on.

Also made a fail2ban banning IPs that reached a certain number of links down. It worked really well, traffic is down 95% and it does not affect any real human users. Its great :)

I have a robots.txt that should tell them not to look at the sites. But if they dont want to read it, I dont want to be nice.

[-] timestatic@feddit.org 5 points 1 week ago

This... is fucking amazing

[-] TropicalDingdong@lemmy.world 5 points 1 week ago

Bruh if you had a live stream of this I would subscribe to your only fans.

[-] KairuByte@lemmy.dbzer0.com 5 points 1 week ago

I… I don’t know how you’d even stream that? A log of pages loaded?

[-] TropicalDingdong@lemmy.world 7 points 1 week ago

A log of pages loaded?

Keep going I'm almost there...

[-] queerlilhayseed@piefed.blahaj.zone 6 points 1 week ago

Requests per second getting higher, and higher, then they level out -- but the server is just barely hanging in there, frantically serving as many requests as it possibly can, and then all at once they come crashing down into warm, gentle waves of relaxing human pings.

[-] scrubbles@poptalk.scrubbles.tech 2 points 1 week ago

How do you do that, I'm very interested! Also good to see you Admiral!

[-] ptz@dubvee.org 3 points 1 week ago* (last edited 1 week ago)

Thanks!

Mostly there's three steps involved:

Setup Nepenthes to receive the traffic
Perform bot detection on inbound requests (I use a regex list and one is provided below)
Configure traffic rules in your load balancer / reverse proxy to send the detected bot traffic to Nepenthes instead of the actual backend for the service(s) you run.

Here's a rough guide I commented a while back: https://dubvee.org/comment/5198738

Here's the post link at lemmy.world which should have that comment visible: https://lemmy.world/post/40374746

You'll have to resolve my comment link on your instance since my instance is set to private now, but in case that doesn't work, here's the text of it:

So, I set this up recently and agree with all of your points about the actual integration being glossed over.

I already had bot detection setup in my Nginx config, so adding Nepenthes was just changing the behavior of that. Previously, I had just returned either 404 or 444 to those requests but now it redirects them to Nepenthes.

Rather than trying to do rewrites and pretend the Nepenthes content is under my app's URL namespace, I just do a redirect which the bot crawlers tend to follow just fine.

There's several parts to this to keep my config sane. Each of those are in include files.

An include file that looks at the user agent, compares it to a list of bot UA regexes, and sets a variable to either 0 or 1. By itself, that include file doesn't do anything more than set that variable. This allows me to have it as a global config without having it apply to every virtual host.
An include file that performs the action if a variable is set to true. This has to be included in the server portion of each virtual host where I want the bot traffic to go to Nepenthes. If this isn't included in a virtual host's server block, then bot traffic is allowed.
A virtual host where the Nepenthes content is presented. I run a subdomain (content.mydomain.xyz). You could also do this as a path off of your protected domain, but this works for me and keeps my already complex config from getting any worse. Plus, it was easier to integrate into my existing bot config. Had I not already had that, I would have run it off of a path (and may go back and do that when I have time to mess with it again).

The map-bot-user-agents.conf is included in the http section of Nginx and applies to all virtual hosts. You can either include this in the main nginx.conf or at the top (above the server section) in your individual virtual host config file(s).

The deny-disallowed.conf is included individually in each virtual hosts's server section. Even though the bot detection is global, if the virtual host's server section does not include the action file, then nothing is done.

Files

map-bot-user-agents.conf

Note that I'm treating Google's crawler the same as an AI bot because....well, it is. They're abusing their search position by double-dipping on the crawler so you can't opt out of being crawled for AI training without also preventing it from crawling you for search engine indexing. Depending on your needs, you may need to comment that out. I've also commented out the Python requests user agent. And forgive the mess at the bottom of the file. I inherited the seed list of user agents and haven't cleaned up that massive regex one-liner.

# Map bot user agents
## Sets the $ua_disallowed variable to 0 or 1 depending on the user agent. Non-bot UAs are 0, bots are 1

map $http_user_agent $ua_disallowed {
    default 		0;
    "~PerplexityBot"	1;
    "~PetalBot"		1;
    "~applebot"		1;
    "~compatible; zot"	1;
    "~Meta"		1;
    "~SurdotlyBot"	1;
    "~zgrab"		1;
    "~OAI-SearchBot"	1;
    "~Protopage"	1;
    "~Google-Test"	1;
    "~BacklinksExtendedBot" 1;
    "~microsoft-for-startups" 1;
    "~CCBot"		1;
    "~ClaudeBot"	1;
    "~VelenPublicWebCrawler"	1;
    "~WellKnownBot"	1;
    #"~python-requests"	1;
    "~bitdiscovery"	1;
    "~bingbot"		1;
    "~SemrushBot" 	1;
    "~Bytespider" 	1;
    "~AhrefsBot" 	1;
    "~AwarioBot"	1;
#    "~Poduptime" 	1;
    "~GPTBot" 		1;
    "~DotBot"	 	1;
    "~ImagesiftBot"	1;
    "~Amazonbot"	1;
    "~GuzzleHttp" 	1;
    "~DataForSeoBot" 	1;
    "~StractBot"	1;
    "~Googlebot"	1;
    "~Barkrowler"	1;
    "~SeznamBot"	1;
    "~FriendlyCrawler"	1;
    "~facebookexternalhit" 1;
    "~*(?i)(80legs|360Spider|Aboundex|Abonti|Acunetix|^AIBOT|^Alexibot|Alligator|AllSubmitter|Apexoo|^asterias|^attach|^BackDoorBot|^BackStreet|^BackWeb|Badass|Bandit|Baid|Baiduspider|^BatchFTP|^Bigfoot|^Black.Hole|^BlackWidow|BlackWidow|^BlowFish|Blow|^BotALot|Buddy|^BuiltBotTough|
^Bullseye|^BunnySlippers|BBBike|^Cegbfeieh|^CheeseBot|^CherryPicker|^ChinaClaw|^Cogentbot|CPython|Collector|cognitiveseo|Copier|^CopyRightCheck|^cosmos|^Crescent|CSHttp|^Custo|^Demon|^Devil|^DISCo|^DIIbot|discobot|^DittoSpyder|Download.Demon|Download.Devil|Download.Wonder|^dragonfl
y|^Drip|^eCatch|^EasyDL|^ebingbong|^EirGrabber|^EmailCollector|^EmailSiphon|^EmailWolf|^EroCrawler|^Exabot|^Express|Extractor|^EyeNetIE|FHscan|^FHscan|^flunky|^Foobot|^FrontPage|GalaxyBot|^gotit|Grabber|^GrabNet|^Grafula|^Harvest|^HEADMasterSEO|^hloader|^HMView|^HTTrack|httrack|HTT
rack|htmlparser|^humanlinks|^IlseBot|Image.Stripper|Image.Sucker|imagefetch|^InfoNaviRobot|^InfoTekies|^Intelliseek|^InterGET|^Iria|^Jakarta|^JennyBot|^JetCar|JikeSpider|^JOC|^JustView|^Jyxobot|^Kenjin.Spider|^Keyword.Density|libwww|^larbin|LeechFTP|LeechGet|^LexiBot|^lftp|^libWeb|
^likse|^LinkextractorPro|^LinkScan|^LNSpiderguy|^LinkWalker|msnbot|MSIECrawler|MJ12bot|MegaIndex|^Magnet|^Mag-Net|^MarkWatch|Mass.Downloader|masscan|^Mata.Hari|^Memo|^MIIxpc|^NAMEPROTECT|^Navroad|^NearSite|^NetAnts|^Netcraft|^NetMechanic|^NetSpider|^NetZIP|^NextGenSearchBot|^NICErs
PRO|^niki-bot|^NimbleCrawler|^Nimbostratus-Bot|^Ninja|^Nmap|nmap|^NPbot|Offline.Explorer|Offline.Navigator|OpenLinkProfiler|^Octopus|^Openfind|^OutfoxBot|Pixray|probethenet|proximic|^PageGrabber|^pavuk|^pcBrowser|^Pockey|^ProPowerBot|^ProWebWalker|^psbot|^Pump|python-requests\/|^Qu
eryN.Metasearch|^RealDownload|Reaper|^Reaper|^Ripper|Ripper|Recorder|^ReGet|^RepoMonkey|^RMA|scanbot|SEOkicks-Robot|seoscanners|^Stripper|^Sucker|Siphon|Siteimprove|^SiteSnagger|SiteSucker|^SlySearch|^SmartDownload|^Snake|^Snapbot|^Snoopy|Sosospider|^sogou|spbot|^SpaceBison|^spanne
r|^SpankBot|Spinn4r|^Sqworm|Sqworm|Stripper|Sucker|^SuperBot|SuperHTTP|^SuperHTTP|^Surfbot|^suzuran|^Szukacz|^tAkeOut|^Teleport|^Telesoft|^TurnitinBot|^The.Intraformant|^TheNomad|^TightTwatBot|^Titan|^True_Robot|^turingos|^TurnitinBot|^URLy.Warning|^Vacuum|^VCI|VidibleScraper|^Void
EYE|^WebAuto|^WebBandit|^WebCopier|^WebEnhancer|^WebFetch|^Web.Image.Collector|^WebLeacher|^WebmasterWorldForumBot|WebPix|^WebReaper|^WebSauger|Website.eXtractor|^Webster|WebShag|^WebStripper|WebSucker|^WebWhacker|^WebZIP|Whack|Whacker|^Widow|Widow|WinHTTrack|^WISENutbot|WWWOFFLE|^
WWWOFFLE|^WWW-Collector-E|^Xaldon|^Xenu|^Zade|^Zeus|ZmEu|^Zyborg|SemrushBot|^WebFuck|^MJ12bot|^majestic12|^WallpapersHD)" 1;

}

deny-disallowed.conf

# Deny disallowed user agents
if ($ua_disallowed) { 
    # This redirects them to the Nepenthes domain. So far, pretty much all the bot crawlers have been happy to accept the redirect and crawl the tarpit continuously 
	return 301 https://content.mydomain.xyz/;
}

[-] scrubbles@poptalk.scrubbles.tech 1 points 1 week ago

Thank you! I'm going to start playing with this and see what I can figure out! I'll be referencing this frequently!

[-] ptz@dubvee.org 2 points 1 week ago* (last edited 1 week ago)

Maybe I should flesh it out into an actual guide. The Nepenthes docs are "meh" at best and completely gloss over integrating it into your stack.

You'll also need to give it corpus text to generate slop from. I used transcripts from 4 or 5 weird episodes of Voyager (let's be honest: shit got weird on Voyager lol), mixed with some Jack Handy quotes and a few transcripts of Married...with Children episodes.

https://content.dubvee.org/ is where that bot traffic lands up if you want to see what I'm feeding them.

[-] wonderingwanderer@sopuli.xyz 1 points 1 week ago

That sounds like iocaine and the book of infinity

[-] early_riser@lemmy.world 15 points 1 week ago

It's already hard enough for self-hosters and small online communities to deal with spam from fleshbags, now we're being swarmed by clankers. I have a little Mediawiki to document my ~~deranged maladaptive daydreams~~ worldbuilding and conlanging projects, and the only traffic besides me is likely AI crawlers.

I hate this so much. It's not enough that huge centralized platforms have the network effect on their side, they have to drown our quiet little corners of the web under a whelming flood of soulless automata.

[-] wonderingwanderer@sopuli.xyz 7 points 1 week ago

Anubis is supposed to filter out and block all those bots from accessing your webpage.

Iocaine, nepenthes, and/or madore's book of infinity are intended to redirect them into a maze of randomly generated bullshit, which still consumes resources but is intended to poison the bots' training data.

So pick your poison

[-] MonkeMischief@lemmy.today 4 points 1 week ago

Iocaine, nepenthes, and/or madore's book of infinity are intended to redirect them into a maze of randomly generated bullshit

We've officially reached a place where cyberspace is beginning to look like communing with the arcane. Lol

[-] NewNewAugustEast@lemmy.zip 6 points 1 week ago* (last edited 1 week ago)

I was up 10 to 20 percent month over month, and suddenly up 1000% it has spiked hard and they all are data harvesters.

I know I am going to start blocking them, which is too bad, I put valuable technical information up, with no advertising, because I want to share it. And I don't even really mind indexers or even AI learning about it. But I cannot sustain this kind of bullshit traffic, so I will end up taking a heavy hand and blocking everything, and then no one will find it.

[-] Thorry@feddit.org 9 points 1 week ago

Yeah I had the same thing. All of a sudden the load on my server was super high and I thought there was a huge issue. So I looked at the logs and saw an AI crawler absolutely slamming my server. I blocked it, so it only got 403 responses but it kept on slamming. So I blocked the IPs it was coming from in iptables, that helped a lot. My little server got about 10000 times the normal traffic.

I sorta get they want to index stuff, but why absolutely slam my server to death? Fucking assholes.

[-] Ephera@lemmy.ml 7 points 1 week ago

My best guess is that they don't just index things, but rather download straight from the internet when they need fresh training data. They can't really cache the whole internet after all...

[-] Techlos@lemmy.dbzer0.com 3 points 1 week ago

Bingo, modern datasets are a list of URL's with metadata rather than the files themselves. Every new team/individual wanting to work with the dataset becomes another DDoS participant.

[-] spicehoarder@lemmy.zip 1 points 1 week ago

The sad thing is that they could cache the whole internet if there was a checksum protocol.

Now that I'm thinking about it, I actually hate the idea that there are several companies out there with graph databases of the entire internet.

[-] punrca@piefed.world 7 points 1 week ago

It's best to use either Cloudflare (best IMO) or Anubis.

If you don't want any AI bots, then you can setup Anubis (open source; requires JavaScript to be enabled by the end user): https://github.com/TecharoHQ/anubis
Cloudflare automatically setups robots.txt file to block "AI crawlers" (but you can setup to allow "AI search" for better SEO). Eg: https://blog.cloudflare.com/control-content-use-for-ai-training/#putting-up-a-guardrail-with-cloudflares-managed-robots-txt

Cloudflare also has an option of "AI labyrinth" to serve maze of fake data to AI bots who don't respect robots.txt file.

[-] eli@lemmy.world 4 points 1 week ago

I ended up just pushing everything behind my tailnet and only leave my game server ports open(which are non-standard ports).

[-] e8CArkcAuLE@piefed.social 3 points 1 week ago* (last edited 1 week ago)

that’s the kind of shit we pollute our air and water for…and properly seal and drive home the fuckedness of our future and planet.

i totally get you sending them to nepenthes though.

[-] JustinTheGM@ttrpg.network 1 points 1 week ago

I'm gonna guess 17:25:20

[-] quick_snail@feddit.nl 1 points 1 week ago

Y'all need to learn to cache things, shiit

[-] poVoq@slrpnk.net 1 points 1 week ago

This is not how things work on the modern web. Did you just wake up from a 20 year coma?

[-] FukOui@lemmy.zip 1 points 1 week ago

What visualisation app is this?

[-] CoreLabJoe@piefed.ca 1 points 1 week ago

Blocking them locally is one way, but if you're already using cloudflare there's a nice way to do it UPSTREAM so it's not eating any of your resources.

You can do geofencing/blocking and bot-blocking via Cloudflare:
https://corelab.tech/cloudflarept2/

[-] x00z@lemmy.world 0 points 1 week ago

50% of my traffic is scrapers now. I really want to block them but I also want my content to be indexed and used for LLMs. At the moment there isn't really an in-between way of doing that. :(

(This is with me knowing they fuck up the electricity nets and memory chips, I'm just hoping that gets better soon.)

[-] Anarki_@lemmy.blahaj.zone 1 points 1 week ago

Why do you want your stuff in the lie machines? 🤔

[-] lost_screwdriver@thelemmy.club 0 points 1 week ago

That they do not become lie machines. Propaganda, lies and fake news from various different sources gets spammed all across the internet. If AI picks it up, it can just spread misinformation, especially if all trustworthy or useful sources block them

[-] poVoq@slrpnk.net 1 points 1 week ago

This will just make them sound more believable when they hallucinate. LLMs can conceptually not be made to not lie, even if all the info they are trained on is 100% accurate.

[-] Decronym@lemmy.decronym.xyz -1 points 1 week ago* (last edited 1 week ago)

Acronyms, initialisms, abbreviations, contractions, and other phrases which expand to something larger, that I've seen in this thread:

Fewer Letters	More Letters
HTTP	Hypertext Transfer Protocol, the Web
IP	Internet Protocol
nginx	Popular HTTP server

2 acronyms in this thread; the most compressed thread commented on today has 4 acronyms.

[Thread #90 for this comm, first seen 13th Feb 2026, 17:41] [FAQ] [Full list] [Contact] [Source code]

[-] A_A@lemmy.world -1 points 1 week ago

vendredi à 16h30 ... curieusement, personne n'essaie de répondre à ta question 😋

[-] zr0@lemmy.dbzer0.com -2 points 1 week ago

And what was the reason for blocking them? What is unbearable?

[-] Ephera@lemmy.ml 2 points 1 week ago

They cause a huge amount of load, deteriorating the service for everyone else. I'm also guessing the time ranges in the graph, where there's no data, is when OP's server crashed from the load and had to restart.

That kind of shit can easily trigger alerting and will look like a DDoS attack. I would be pissed, too, if I dropped everything to see why my server is going down and it's not even proper criminals, but rather just some silicon valley cunts.

[-] zr0@lemmy.dbzer0.com 2 points 1 week ago

Thanks for your time explaining. I have multiple public facing services and I never had any issues with load just because of some crawlers. That’s why I always wonder why people get so mad at them

[-] GreenKnight23@lemmy.world -2 points 1 week ago

it's pretty rare for dumbasses to point themselves out these days.

you're doing gods work son. keep it up!

this post was submitted on 13 Feb 2026

70 points (100.0% liked)

Selfhosted

56763 readers

266 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.
No spam posting.
Posts have to be centered around self-hosting. There are other communities for discussing hardware or home computing. If it's not obvious why your post topic revolves around selfhosting, please include details to make it clear.
Don't duplicate the full text of your blog or github here. Just post the link for folks to click.
Submission headline should match the article title (don’t cherry-pick information from the title to fit your agenda).
No trolling.
No low-effort posts. This is subjective and will largely be determined by the community member reports.

Resources:

selfh.st Newsletter and index of selfhosted software and apps
awesome-selfhosted software
awesome-sysadmin resources
Self-Hosted Podcast from Jupiter Broadcasting

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 2 years ago

MODERATORS

HybridSarcasm@lemmy.world

HybridSarcasm@lemmy.hybridsarcasm.xyz