After yet another bot scraping wave forcing me to do sysadmin work at 3am, and me ranting about it on lemmy, @self@awful.systems linked me to a post that referenced iocaine which sounded a perfect way to get back on bots that don't respect our resources and time.
At the same time, we recently on-boarded @tenchiken@lemmy.dbzer0.com as an extra sysadmin to reduce the "bus factor" of our instance (say hello), and they graciously offered some spare compute they had lying around. So I thought, since serving iocaine to bots doesn't really require any serious uptime, why not put it those resources to good use.
So after a couple of hours messing up with things, I've now deployed iocaine to protect our instance as well as fediseer. This should hopefully start messing back with these bastards by serving them some surrealistic nonsense I had squirreled away.
If you want to see this in action, set your user agent to GPTBot
and visit our instance. If you find yourself trapped in iocaine somehow, just let us know.
๐ Hi, I really hope
iocaine
works for you and I think it still might be wise to temper expectations. Some background, I work in bot detection and mitigation.I quickly tried reading through their code and documentation but I don't see the main detection mechanism that determines human vs bot other than what you mentioned as an example. If it's user agent based, it is trivially easy to spoof as you already know. I am finding in my work that these companies do not keep the user agent they report in their documentation when challenged.
My second concern was the page the reverse proxy served when spoofing my user agent. The DOM was nowhere close to that of Lemmy and I think it's important to point out that a simple check for specific elements on the page will keep the bot from poisoning itself.
I admit I could be too close to this problem to see other solutions, and I really hope it works. It sucks that this is a problem. I wish there were more open source options too.
If for some reason this solution doesn't work, and if anyone is interested in help, I am more than happy to freely offer my knowledge.
Thanks. Iocaine doesn't do detection, it only does the poisoning. The detection is currently manual. We do it based on agents and ip ranges. These bots are extraordinarily stupid atm, which is what is the biggest issue. The ones causing us down times were hitting obsolete domains and stupid links constantly. They are very very crude. They are not sophisticated yet to check DOM but they can tell when they've been blocked and switch to proxies. Sending them to iocaine is meant to not let them realize they're blocked.
Obviously someone smart can easily defeat it, even by just respecting our resources. But these fuckers are very greedy atm. We'll have to evolve along with them.
For what it's worth, this is just damage control and first step. Deployment was trivial compared to most other ideas, so it seemed worth at least giving a go.
Our expectations are very much tempered, but trying to be optimistic on even a small reprieve.
Thanks for the Dom detail!
I feel you and feel for you. I really do hope you get a reprieve because dealing with this is nonsense.