282

lemmy.dbzer0.com now protected by Haphash Proof-of-Work [GenAI] [Instance] (lemmy.dbzer0.com)

submitted 2 months ago* (last edited 2 months ago) by db0@lemmy.dbzer0.com to c/div0@lemmy.dbzer0.com

41 comments fedilink hide all child comments

Ahoy m@tes, the scraping bot situation has been escalating recently, as you all may have already noticed by the recent site instability and 5xx error responses. @tenchiken@anarchist.nexus has been scrambling to block new scraping subnets as they appear, but these assholes keep jumping providers so it's been an endless loop and constant firefighting.

I finally had enough and decided to onboard a Proof-of-Work countermeasure, very much like Anubis which has been very popular on the fediverse lately. However I went with Haphash which has been especially designed around haproxy (our reverse proxy of choice) and is hopefully much more lightweight.

The new PoW shield has already been activated on both Divisions by Zero on Fediseer as well. It's not active on all URLs,. but it should be protecting those which have the most impact on our database, which is what was causing the actual issue. You should notice a quick loading screen on occasion while it's verifying you.

We've already seen a significant reduction in 5xx HTTP errors, as well as a slight reduction in traffic, so we're hoping this will make a good impact in our situation.

Please do let us know if you run into any issues, and also let us know if you feel any difference in responsiveness. The first m@ates already feel it's all snappier, but that just be placebo.

And let's hope the next scraping wave is not pwned residential botnets, or we're all screwed >_<

all 42 comments

sorted by: hot top controversial new old

[-] henfredemars@infosec.pub 71 points 2 months ago

Fighting the good fight. The Internet should be for people, not robots.

[-] zr0@lemmy.dbzer0.com 41 points 2 months ago

Scraping is neither new, nor always malicious. Without scraping, no search engine would work and there would be no archive.org wayback machine.

However, AI scrapers all copy the same shit over and over again and do not intend to lead traffic to your site. They just cause cost and don’t give anything in return. This is the problem.

[-] rumba@lemmy.zip 9 points 2 months ago

Honestly, My head says lemmy should be search indexed to drive traffic here, but my heart says i don't need lemmy to be indexed by google to enjoy it and i'd rather not have the rest of reddit over here stinking up the place :)

[-] aeshna_cyanea@lemmy.dbzer0.com 1 points 2 months ago

Imo in an ideal world google would be publicly owned and it would cooperate with other sites to voluntarily share their own self-indexes

[-] rumba@lemmy.zip 1 points 2 months ago

interesting thought... do you think we would?

[-] dreadbeef@lemmy.dbzer0.com 7 points 2 months ago

The web is for user agents

[-] ramble81@lemmy.zip 40 points 2 months ago

I do not understand the point of scraping Lemmy. Just set up your own instance, or hell, just mimic the open AcitivityPub protocol and get all the results delivered to you in a nicely packaged json file for you to parse however you want.

[-] db0@lemmy.dbzer0.com 49 points 2 months ago

That's entirely too much work when they can just make one vibe-coded bot and just offload the scraping costs to us as externalities

[-] MathiasTCK@lemmy.world 8 points 2 months ago

Externalities all the way down as usual.

[-] RickyRigatoni@retrolemmy.com 5 points 2 months ago

when lemmy gets popular enough they will be able to vibe code a working lemmy docker container

[-] x1gma@lemmy.world 15 points 2 months ago

Yes, but that's work and costing the scrapers resources, while simply siccing some scraper is simpler and costs less.

[-] irelephant@lemmy.dbzer0.com 6 points 2 months ago

They're interested in more than lemmy.

[-] p03locke@lemmy.dbzer0.com 28 points 2 months ago

What's "Proof-of-Work" in this context? I've see it used quite a bit with crypto-mining.

[-] CameronDev@programming.dev 57 points 2 months ago

Proof of work means that your client has to do some "work" in order to gain access. It typically means a challenge that can't be trivially solved, but can be trivially verified.

For example, the challenge may be something to the effect of:

"Give me a string, that when hashed by md5, results in a hash that ends in 1234".

Your browser can then start bruteforcing until it finds a string (should take a few seconds max), and then it can pass the string back to the server. The server can verify with a single hash, and you're in.

Its not wildly different to crypto mining, but the difficulty is much lower for antibot, as it needs to be solveable in seconds by even low end devices.

[-] Infernal_pizza@lemmy.dbzer0.com 9 points 2 months ago

What stops the bots just solving it?

[-] mic_check_one_two@lemmy.dbzer0.com 26 points 2 months ago

Two things: First, bots don’t typically allow JavaScript. No JS, no entry. A user can temporarily enable JS if they’re stuck on an endless loading screen. But a scraper won’t.

Second, the fact that they’d need to solve them for every single bot, and every single site they scrape. It’s a low barrier for regular users, but it’s astronomical for scrapers who are running hundreds of thousands of bots.

[-] kernelle@lemmy.dbzer0.com 9 points 2 months ago

Cost of electricity for the most part. Having a scraper visit 100's of URL's per second isn't unheard of, adding this should reduce the speed of the same scraper by 30-70% depending on the request

[-] p03locke@lemmy.dbzer0.com 4 points 2 months ago

Funny, HTTPS is computationally-expensive for similar reasons, but I guess this system works across sessions, with a front-loaded cost.

[-] CameronDev@programming.dev 4 points 2 months ago

I think they are on different scales, there is no bruteforcing involved in https/SSL.

[-] socsa@piefed.social 2 points 2 months ago

I guess the bots don't know how to rainbow table.

[-] CameronDev@programming.dev 5 points 2 months ago

Its usually designed so that you can't rainbow table.

give me a string that starts with "xyz", and hashes to "000..."

That can't be rainbow tabled, as the server can force a different salt.

(Note, I dont know the exact algorithms involved, just the general theory)

[-] blindsight@beehaw.org 3 points 2 months ago

You can uniquely salt every request trivially, so rainbow tables are effectively useless.

[-] db0@lemmy.dbzer0.com 18 points 2 months ago

https://en.wikipedia.org/wiki/Hashcash

[-] cassandrafatigue@lemmy.dbzer0.com 3 points 2 months ago* (last edited 2 months ago)

You gotta do some math problems before you're allowed to play outside.

But for computer.

[-] atkdef@lemmy.dbzer0.com 14 points 2 months ago

I wonder if this change affects Lemmy clients? I use Voyager and it currently works fine.

[-] db0@lemmy.dbzer0.com 16 points 2 months ago

It shouldn't. So long as you can run JavaScript, it should work everywhere.

[-] tenchiken@lemmy.dbzer0.com 5 points 2 months ago

Most clients use an API connection which is specific to Lemmy. This is extra work to make scrapers speak that language so no scraper does.

For the moment, a scraper trying to hit any API endpoint would just get a simple malformed request error. With any luck, it stays this way so we don't have to protect API directly.

[-] UnrefinedChihuahua@lemmy.dbzer0.com 12 points 2 months ago

Great work, as always!

[-] Resonosity@lemmy.dbzer0.com 11 points 2 months ago

Well done, and as expected! Y'all are knocking it out

[-] olafurp@lemmy.world 11 points 2 months ago

This is amazing, this type of anti-bot access should be rolled out everywhere. I wouldn't mind my battery life being cut by 10% just to access bot free content.

[-] null_dot@lemmy.dbzer0.com 5 points 2 months ago

I may be misunderstanding this measure but I don't think that's going to be mitigated.

If I understand correctly, this requires browsers requesting a page to do a small amount of "work" for no reason other than demonstrating they're willing to do it. As a once off for devices used by humans, it's barely noticeable. For bots reading millions of pages it's untenable - they'll just move on to easier targets.

However, that only works for bots who's purpose is to injest large quantities of text.

For a bot who's purpose is to make posts, or upvote things, or reply to other comments, they're much less sensitive to this measure because they don't need to harvest millions of pages.

[-] olafurp@lemmy.world 1 points 2 months ago

The threshold for work would have to be increased. 10s work to post a comment will be tricky to justify the cost for.

[-] null_dot@lemmy.dbzer0.com 3 points 2 months ago

That would be an unethical waste of power for legit users

[-] veniasilente@lemmy.dbzer0.com 1 points 2 months ago

I would, however. I don't know if electricity with the correct voltage and amperage just grows in trees up there in the US, but in the rest of the world, we have to pay up for electricity, and having to consume more of it also means a larger damage to our local environment, already preyed upon by northern-hemisphere corporations.

Not to mention, it effectively raises our power bill for no new gain as well, which comes with a very bad timing due to recent scandals (up to Constitutional Accusation Summons) in how the costs of energy transportation are being billed to users in my country. Besides all the local increases of cost, it mechanically functions not very different from rent-seeking.

[-] whotookkarl@lemmy.dbzer0.com 8 points 2 months ago

It seems much better to me today than it did over the weekend, no error or timeout responses today. Thanks!

[-] zaknenou@lemmy.dbzer0.com 7 points 2 months ago

always interesting choices with software on our ship, as the motto says:

Be Weird, Download a Car, Generate Art, Screw Copyrights, Do Maths

[-] Grainne@lemmy.dbzer0.com 7 points 2 months ago

Thanks for all the hard work!

this post was submitted on 08 Dec 2025

282 points (94.6% liked)

/0

1615 readers

1 users here now

Meta community. Discuss about this lemmy instance or lemmy in general.

Service Uptime view

founded 2 years ago

MODERATORS

db0@lemmy.dbzer0.com