I was wrong about robots.txt (evgeniipendragon.com)

submitted 2 weeks ago by KarlHeinzSchwuke@feddit.org to c/technology@lemmy.world

13 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[-] INeedMana@piefed.zip 0 points 2 weeks ago

Huh. So in this case, the file actually is respected. Refreshing

[-] ell1e@leminal.space 1 points 2 weeks ago* (last edited 2 weeks ago)

Often it is respected, but the resulting problem is platforms conflate things with the questionable AI scraping crawlers to blackmail websites into participating in feeding AI.

For example, Googlebot if enabled won't just list you for search, but will also scrape your contents for Google's AI. Edit: see https://arstechnica.com/tech-policy/2025/07/cloudflare-wants-google-to-change-its-ai-search-crawling-google-likely-wont/ as source. I imagine LinkedinBot, given it's microsoft, will feed some other AI of theirs as well on top of the previews.

Until regulation steps in to require AI bots to separately ask for crawling permission, or to actually get a proper license for reuse of the contents, this situation isn't going to improve.

[-] General_Effort@lemmy.world 0 points 2 weeks ago

Googlebot if enabled won’t just list you for search, but will also scrape your contents for Google’s AI.

False.

[-] ell1e@leminal.space 0 points 2 weeks ago* (last edited 2 weeks ago)

See here: https://arstechnica.com/tech-policy/2025/07/cloudflare-wants-google-to-change-its-ai-search-crawling-google-likely-wont/ If you have a source that says it's false, I'd be curious.

[-] General_Effort@lemmy.world 0 points 2 weeks ago* (last edited 2 weeks ago)

Ok. That quotes a tweet by Cloudflare's CEO. IDK what his qualifications are, but his conflict of interest is obvious enough. Real quality journalism there.

ETA: I looked at what the Cloudflare CEO said again. To be fair to him, he is not actually claiming that Googlebot collects AI training data. He's talking about the AI overview, which is a search feature. The data for search features is collected by Googlebot. I'm not sure why someone would want their link listed in search but not appear much more prominently in the AI overview.

Here's Google technical documentation on its crawlers: https://developers.google.com/search/docs/crawling-indexing/google-common-crawlers

[-] ell1e@leminal.space 1 points 2 weeks ago* (last edited 2 weeks ago)

So what's the quote from the documentation that backs up your claim? The line "perform other product specific crawls" seems extremely vague by design.

[-] General_Effort@lemmy.world 0 points 2 weeks ago

I'm not really sure what you are asking here. Did you notice that you can scroll down and see a list of their crawlers?

[-] ell1e@leminal.space 1 points 2 weeks ago* (last edited 2 weeks ago)

Nothing on this page seems to contradict the article. But if I simply missed the part that does, I'd be happy to learn.

[-] General_Effort@lemmy.world 0 points 2 weeks ago

You look up what Googlebot does. No AI.

You want to know what crawlers do AI? Just search for "AI", or "training", or some such, or skim through. It's not long. Google-Extended collects training data. Note that Google-Extended is explicitly not used to rank pages.

Did that help?

[-] ell1e@leminal.space 2 points 2 weeks ago* (last edited 2 weeks ago)

You look up what Googlebot does. No AI.

The page seems written to perhaps suggest it but doesn't explicitly say the other bots can't feed into some other sort of AI training. It would be in Google's interest to mislead the users here.

Edit: I found a quote where it says Googlebot does both in one: "Google-Extended doesn't have a separate HTTP request user agent string. Crawling is done with existing Google user agent [...]" and I guess Cloudflare doesn't trust Google to abide by the access controls. That seems sensible to me.

load more comments (5 replies)

this post was submitted on 17 Jul 2025

9 points (90.9% liked)

Technology

73570 readers

1375 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

L3s@lemmy.world

enu@lemmy.world

technopagan@lemmy.world

L4s@lemmy.world

L3s@hackingne.ws

L4s@hackingne.ws