589

AI crawlers cause Wikimedia(The umbrella organization of Wikipedia and a dozen or so other crowdsourced knowledge projects) Commons bandwidth demands to surge 50%. (diff.wikimedia.org)

submitted 8 months ago* (last edited 8 months ago) by Tea@programming.dev to c/technology@lemmy.world

55 comments fedilink hide all child comments

top 50 comments

sorted by: hot top controversial new old

[-] thisbenzingring@lemmy.sdf.org 176 points 8 months ago

what assholes .. just fucking download the full package and quit hitting the URL

[-] cm0002@lemmy.world 103 points 8 months ago

Right‽ This is ridiculously stupid when you can download the entirety of Wikipedia in a single package and parse it to your hearts desire

[-] TheTechnician27@lemmy.world 71 points 8 months ago* (last edited 8 months ago)

Not only that, but we make it goddamn trivial for not just Wikipedia but for other Wikimedia projects. Doing this is just stealing without attribution and share-alike like the CC BY-SA 4.0 license demands and then on top of that kicking down the ladder for people who actually want to use Wikimedia and not the hallucinatory slop they're trying to supplant it with. LLM companies have caused incalculable damage to critical thinking, the open web, the copyleft movement, and the climate.

[-] ChaoticCookie@sh.itjust.works 15 points 8 months ago

Yay interrobang :D

[-] Glitchvid@lemmy.world 23 points 8 months ago

The amount of stupid AI scraping behavior I see even on my small websites is ridiculous, they'll endlessly pound identical pages as fast as possible over an entire week, apparently not even checking if the contents changed. Probably some vibe coded shit that barely functions.

[-] gravitas_deficiency@sh.itjust.works 11 points 8 months ago

If I was running infra for them, I’d just start blacklisting abusive IPs without warning

[-] krigo666@lemmy.world 113 points 8 months ago

Laws should be passed in all countries that AI crawlers should request permission before crawling whatever target site. I haver no pity to AI "thiefs" that get their models poisoned. F...ing plague, wasn't enough the adware and spyware...

[-] catloaf@lemm.ee 19 points 8 months ago

An HTTP request is a request. Servers are free to rate limit or deny access

[-] FaceDeer@fedia.io 18 points 8 months ago

And Wikimedia, in particular, is all about publishing data under open licenses. They want the data to be downloaded and used by others. That's what it's for.

[-] taladar@sh.itjust.works 12 points 8 months ago

Rate limiting in itself requires resources that are not always available. For one thing you can only rate limit individuals you can identify so you need to keep data about past requests in memory and attach counters to them and even then that won't help if the requests come from IPs that are easily changed.

[-] chrash0@lemmy.world 18 points 8 months ago

i doubt the recent uptick in traffic is from “stealing data” for training but rather from agents scraping them for context, eg Edge Copilot, Google’s AI search, SearchGPT, etc.

poisoning the data will likely not help in this situation since there’s a human on the other side that will just do the same search again given unsatisfactory results. like how retries and timeouts can cause huge outages for web scale companies, poisoning search results will likely cause this type of traffic to increase and further increase the chances of DoS and higher bandwidth usage.

[-] TheBlackLounge@lemm.ee 7 points 8 months ago

So? Break context scrapers till they give up, on your site or completely.

[-] chrash0@lemmy.world 2 points 8 months ago

easily said

[-] cupcakezealot@lemmy.blahaj.zone 59 points 8 months ago

wikipedia should install ai mazes on their servers

[-] inbeesee@lemmy.world 1 points 8 months ago

Are there alternatives besides Cloudflare's solution?

[-] skulblaka@sh.itjust.works 1 points 8 months ago

Nepenthes does about the same thing but isn't managed by a corp.

load more comments (1 replies)

[-] cdkg@lemm.ee 39 points 8 months ago

We should stop using ai

[-] andybytes@programming.dev 15 points 8 months ago

I still struggle with a use case for artificial intelligence in my own life. I play around with it all and I'm just like, it doesn't do a good job. Also, I think humanity is missing the plot, you know? Like, we don't need government. If government isn't going to do government. Government serves the people, not corporations. Or at least it should. I don't know, I think we're entering in times. At some point, I think people will pray for nuclear war, because life will be so miserable. That it would be better than just to end it all.

[-] Spaniard@lemmy.world 4 points 8 months ago* (last edited 8 months ago)

Like, we don't need government.

Welcome to the anarchist. Now you have to pick your flavor! Social Anarcho-syndicalism, Anarcho comunist, anarcho-capitalism, anarcho christianis, and the list goes on!

I found LLMs helpfuls to develop some scripts and answer some simple trivial questions (like how does house property work in China). I could have looked for that in a regular search engine though. But that's it, I am still happy looking for things myself and investigating since you can't really trust their answers.

[-] kuberoot@discuss.tchncs.de 1 points 8 months ago

At some point, I think people will pray for nuclear war, because life will be so miserable.

Reminds me of Roll out the Fallout by The Chalkeaters

[-] andybytes@programming.dev 5 points 8 months ago

even better stop joining major platforms like social media and then they won't be able to create data sets. Be a leach, especially when they give it away for free, but don't contribute to the project. Understand how it works, sure. But it seems like most of humanity says they don't want something, yet they do the contrary. It's like we choose to comply before we even ask to comply for the fear of missing out. But if you look at what is today, what are you really missing out on?

[-] pulsewidth@lemmy.world 9 points 8 months ago

So, uh. What about Lemmy?

They can also crawl this publically-accessible social media source for their data sets.

I'm on board with abandoning mainstream social media, but my point is that your suggestion would not solve the problem just relocate it. A better solution to the AI conglomerates stealing everyone's data from the open Internet is legislation and regulations - ie tackling the whole 'stealing data' component, along with stronger privacy regulations for everyone to make it harder for them to do the same in the future. It's nice seeing the EU taking some positive steps, but we will not see the US take any steps in that direction anytime soon, due to corporate capture of their politicians and the AI companies all being in the top 10 most wealthy companies in the US.

[-] Spaniard@lemmy.world 8 points 8 months ago* (last edited 8 months ago)

It's nice seeing the EU taking some positive steps

Yet they helped introducing the super cookies and are trying to end encryption on communications.

[-] Saik0Shinigami@lemmy.saik0.com 1 points 8 months ago

They can also crawl this publically-accessible social media source for their data sets.

Crawling would be silly. They can simply setup a lemmy node and subscribe to every other server. Activitypub crawler would be much more efficient as they wouldn't accidentally crawl things that haven't changed, but instead can read the activitypub updates.

[-] collapse_already@lemmy.ml 35 points 8 months ago

And the quality of the AI output sucks. I was recently looking for information about positive convention for yaw, pitch, and roll in aircraft. I was looking at az and yaw and got reasonable results from the AI, but when I looked at pitch and el all of the results were about elevator pitches. Even when I spelled out elevation it insisted on elevator pitches. I scroll past the AI results as a matter of principle, but I usually look at them so I have something specific to complain about when people ask why I am so virulently anti-AI.

[-] bassomitron@lemmy.world 12 points 8 months ago

AI is useful for basic, mundane tasks and that's about it. Trying to force it to be some sort of Uber search engine is such a bad idea.

[-] GnuLinuxDude@lemmy.ml 10 points 8 months ago

The other day I tried to have it help me with a programming task on a personal project. I am an experienced programmer, but I only "get by" in Python (typically just by looking up the documentation for the standard library). I thought, "OK. This is it. I will ask Llama 3.3 and GPT4 for help."

That shit literally set me back a weekend. It gave me such bad approaches and answers, that I could tell were bad (aforementioned experience in programming, degree in comp sci, etc) that I got confused about writing Python. Had I just done what I usually do, which is to look up the documentation and use my brain, I would have gotten my weekend task done a whole weekend sooner.

It scares me to think what people are doing to themselves by relying on this, especially if they're novices.

[-] BakerBagel@midwest.social 6 points 8 months ago

I recently started as a graphic designer despite knowing absolutely nothing about it, so i am constantly searching how to do stuff in Adobe suite at work. Half the time Google's AI can't even keep "Cmnd" and "ctrl" straight, telling me to use' "cmnd+shift+H" on Windows or "ctrl+shift+H" on Mac'. I don't even know how it botches that, but it does it about 25% of the time.

[-] reiterationstation@lemm.ee 1 points 8 months ago

Yea that’s a bad example of what to use ai for at least right now. You’re going to get bad results with that question.

It’s good for things, if you pay.

[-] collapse_already@lemmy.ml 7 points 8 months ago

I don't want to ask ai. Google automatically gives me ai search results that are piss poor. Those useless results still use energy to generate.

[-] Telorand@reddthat.com 23 points 8 months ago

AI: The "pen that can write in zero gravity" when pencils exist.

[-] Alk@sh.itjust.works 41 points 8 months ago

Well I get the analogy, but also I think they didn't use pencils because of the graphite and complications with filtering air or something.

[-] catloaf@lemm.ee 25 points 8 months ago

Graphite is conductive. A short circuit and fire are Very Bad.

[-] AlternatePersonMan@lemmy.world 3 points 8 months ago

Couldn't you just use a charcoal pencil or crayon instead?

[-] catloaf@lemm.ee 20 points 8 months ago* (last edited 8 months ago)

Yes, but neither of those write as cleanly. And both are still prone to fragmenting, even if the fragments aren't conductive.

[-] Telorand@reddthat.com 8 points 8 months ago

You may be right. It's just easier to get the sentiment across that way than expound about how it's ridiculously complex and overbuilt to achieve menial results.

load more comments (4 replies)

[-] Buelldozer@lemmy.today 13 points 8 months ago

This pen / pencil thing has been corrected so many times for so many decades that it's ludicrous people are still bringing it up.

https://www.scientificamerican.com/article/fact-or-fiction-nasa-spen/

Random bits of pencil lead floating around in a high tech environment is such a poor idea that even the Soviet's quit using pencils once Fisher's Space Pen was available. A pen which Fisher itself paid to develop and then sold to both NASA and the Soviet Space Program.

[-] Telorand@reddthat.com 1 points 8 months ago

Yeah, I know it's not precisely correct, but it's a fable that's commonly understood as an example of over-engineering. I'm open to better and more factual examples, if you have any!

[-] hakunawazo@lemmy.world 6 points 8 months ago

Pen, pencil. Both are dangerous in the wrong hands.

[-] andybytes@programming.dev 19 points 8 months ago

This is an example of corporate terrorism sponsored by our own government. Elon Musk loves to see himself as the villain in Ready Player One. And this is not a joke you can look it up. Big tech is waging war against American citizens, and no longer do we have any control of our government, and the Democrats will not save us. The electoral processes will not save us. This is just hard for some people to accept, that's why things have to fall apart before they get a clue. Unfortunately, those that are wiser are going to feel the flames first.

load more comments (1 replies)

[-] pelespirit@sh.itjust.works 9 points 8 months ago

Support Wikipedia! They're awesome and are our backbone.

[-] andybytes@programming.dev 3 points 8 months ago

Artificial intelligence is really just a Swiss army, knife, version of a search engine. It serves the interest of the ruling elite and is a way the uber wealthy can make America stupid again or even dumber than dumb. Like we move forward fast but we always miss the point and we're always putting out fires that once were easily manageable. The code base is growing we're all saying a lot of nothing. It's really just drowning out voices and good ideas just to centralize everything. It's about power. It's about control. It's about the man behind the curtain in the wizard of Oz. I like technology a lot and I would like it to be in the hands of working people and not the uber wealthy. So I am no techno optimist at all, not until we fix that issue.

load more comments

this post was submitted on 02 Apr 2025

589 points (99.3% liked)

Technology

77628 readers

968 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

L3s@lemmy.world

enu@lemmy.world

technopagan@lemmy.world

L4s@lemmy.world

L3s@hackingne.ws

L4s@hackingne.ws