Websites have no interest in banning VPNs and excluding visitors. The fact is that they are a conduit for spam, bots and more rarely hacking and so hosts will protect themselves. Self defence.
How does it defend a website to deny reading access to static content?
Topical answer: Bots going around scraping content to feed into some LLM dataset without consent. If the website is anything like Reddit they'll be trying to monetise bot access to their content without affecting regular users.
It should be easy to distinguish a bot from a real user though, isn't it?
Nope. It gets difficult every single day. Used to be easy - just check the user agent string. Real users will have a long one that talks about what browser they’re using. Bots won’t have it or will have one that mentions the underlying scraping library they’re using.
But then bot makers wizened up. Now they just copy the latest browser agent string.
Used to be that you could use mouse cursor movement to create heat maps and figure out if it’s a real user. Then some smart Alec went and created a basic script to copy his cursor movement and broke that.
Oh, and then someone created a machine learning model to learn that behavior too and broke that even more.
Good point, thank you. Uh... beep!
Unfortunately not. The major difference between an honest bot and a regular user is a single text string (the user agent). There's no reason that bots have to be honest though and anyone can modify their user agent. You can go further and use something like Selenium to make your bot appear even more like a regular user including random human-like mouse movements. There are also a plethora of tools to fool captchas now too. It's getting harder by the day to differentiate.
Security
Confidentiality Integrity Availability