602

The Open-Source Software Saving the Internet From AI Bot Scrapers (www.404media.co)

submitted 4 weeks ago by fattyfoods@feddit.nl to c/opensource@lemmy.ml

106 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[-] interdimensionalmeme@lemmy.ml 2 points 4 weeks ago

This would not be a problem if one bot scraped once, and the result was then mirrored to all on Big Tech's dime (cloudflare, tailscale) but since they are all competing now, they think their edge is going to be their own more better scraper setup and they won't share.

Maybe there should just be a web to torrent bridge sovtge data is pushed out once by the server and tge swarm does the heavy lifting as a cache.

[-] deadcade@lemmy.deadca.de 2 points 4 weeks ago

No, it'd still be a problem; every diff between commits is expensive to render to web, even if "only one company" is scraping it, "only one time". Many of these applications are designed for humans, not scrapers.

[-] interdimensionalmeme@lemmy.ml 0 points 4 weeks ago

If the rendering data for scraper was really the problem Then the solution is simple, just have downloadable dumps of the publicly available information That would be extremely efficient and cost fractions of pennies in monthly bandwidth Plus the data would be far more usable for whatever they are using it for.

The problem is trying to have freely available data, but for the host to maintain the ability to leverage this data later.

I don't think we can have both of these.