Top 10 website in the world ( July 2025 ) (lemmy.ml)

submitted 17 hours ago by arunshah240@lemmy.world to c/technology@lemmy.ml

23 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[-] clb92@feddit.dk 30 points 16 hours ago* (last edited 16 hours ago)

Why would anyone crawl Wikipedia when you can freely download the complete databases in one go, likely served on a CDN...

But sure, crawlers, go ahead and spend a week doing the same thing in a much more expensive, disruptive and error-prone way...

[-] eager_eagle@lemmy.world 10 points 15 hours ago* (last edited 15 hours ago)

There are valid reasons for not wanting the whole database e.g. storage constraints, compatibility with ETL pipelines, and incorporating article updates.

What bothers me is that they -- apparently -- crawl instead of just... using the API, like:

https://en.wikipedia.org/w/api.php?action=parse&format=json&page=Lemmy_%28social_network%29&formatversion=2

I'm guessing they just crawl the whole web and don't bother to add a special case to turn Wikipedia URLs into their API versions.

[-] clb92@feddit.dk 1 points 36 minutes ago

valid reasons for not wanting the whole database e.g. storage constraints

If you're training AI models, surely you have a couple TB to spare. It's not like Wikipedia takes up petabytes or anything.

[-] limer@lemmy.ml 8 points 16 hours ago

Vibe coding

[-] Pechente@feddit.org 1 points 16 hours ago

My comment was based on a podcast I listened to (Tech won’t save us, I think?). My guess is they also wanna crawl all the edits, discussion etc. which is usually not included in the complete dumps.

[-] clb92@feddit.dk 1 points 33 minutes ago

Dumps with complete page edit history can be downloaded too, as far as I can see, so no need to crawl that.

[-] mesamunefire@piefed.social 2 points 9 hours ago

Good pod cast

this post was submitted on 17 Aug 2025

64 points (91.0% liked)

Technology

39382 readers

299 users here now

This is the official technology community of Lemmy.ml for all news related to creation and use of technology, and to facilitate civil, meaningful discussion around it.

Ask in DM before posting product reviews or ads. All such posts otherwise are subject to removal.

Rules:

1: All Lemmy rules apply

2: Do not post low effort posts

3: NEVER post naziped*gore stuff

4: Always post article URLs or their archived version URLs as sources, NOT screenshots. Help the blind users.

5: personal rants of Big Tech CEOs like Elon Musk are unwelcome (does not include posts about their companies affecting wide range of people)

6: no advertisement posts unless verified as legitimate and non-exploitative/non-consumerist

7: crypto related posts, unless essential, are disallowed

founded 6 years ago

MODERATORS

MinutePhrase@lemmy.ml