179

Sites scramble to block ChatGPT web crawler after instructions emerge (arstechnica.com)

submitted 2 years ago by UngodlyAudrey@beehaw.org to c/technology@beehaw.org

35 comments fedilink hide all child comments

all 38 comments

sorted by: hot top controversial new old

[-] Tibert@compuverse.uk 41 points 2 years ago

Like it is useful... Open ai already got all the useful info out of the websites.

Tho maybe for the sites generating new content it may have a use. But all the content before that is already lost to chatgpt.

[-] FaceDeer@kbin.social 29 points 2 years ago

"Lost to ChatGPT" is a weird way of putting it. The content is still there, nothing's happened to it.

[-] acastcandream@beehaw.org 35 points 2 years ago

You know what he means. My data is “lost” to google, I’m sure you’d agree with that.

[-] mp3@lemmy.ca 26 points 2 years ago

Lemmy.ca added a block at the nginx level for it

https://lemmy.ca/comment/1999439

# curl -H 'User-agent: GPTBot' https://lemmy.ca/ -i
HTTP/2 403

[-] thebardingreen@lemmy.starlightkel.xyz 3 points 2 years ago

Hilariously, unless ALL lemmy instances do this, anyone that federates with you will have to block it too or any communities they sync with you will be available on their instances...

[-] mp3@lemmy.ca 5 points 2 years ago

I know but at this point you do what you can.

[-] On@kbin.social 24 points 2 years ago

Is it possible that they offloaded the scraping to a different company to avoid direct litigation now theyre out in the open? To say "we didn't scrape your website, and you can't prove it."

Like DDG, Ecosia, Qwant use Bing for their data Or how feds buy data from data brokers. Outsource the dirty job like every tech company does and shift the blame if caught doing something unlawful.

It seems they are trying to garner some positive PR after they scraped through everything without anyone noticing.

[-] sxan@midwest.social 4 points 2 years ago

I absolutely believe a lot of companies outsource simply because they don't want to build an internal organ to do it. Even in government, despite what Conservatives believe, most organization heads are pretty focused on core competency and press to use outsourced resources. This latter also promoted by heavy lobbying by the companies selling the services.

This is a situation of "never attribute to malice that which can be easily explained by stupidity." Sure, some are motivated by malice or subterfuge, but most are probably just buying services because they have other things they'd rather focus on.

[-] TehPers@beehaw.org 2 points 2 years ago* (last edited 2 years ago)

Why would they be concerned about litigation? As far as I know, scraping is completely legal in most/all countries (including the US, which I'm more familiar with and they're headquartered out of), as long as you're respecting copyright and correctly handling PII (which they claim to be making an effort on).

[-] ashtrix@lemmy.ca 21 points 2 years ago

Yeah, it's already too late. Why didn't they provide this before they already scraped websites?

[-] p03locke@lemmy.dbzer0.com 15 points 2 years ago

You think Google thought about robots.txt before they developed their search engine? Nah, it's all public Internet, and they scraped away.

A non-zero percentage of web sites will bother to follow these instructions, but it might as well be zero.

[-] scrubbles@poptalk.scrubbles.tech 8 points 2 years ago

Yeah I always assumed robots.txt only told them to hide it from search results, but Google still scrapes everything they can from you. The illusion they skipped over you

[-] drwho@beehaw.org 2 points 2 years ago

If you look in the server logs, you can see what their spiders are grabbing.

[-] abhibeckert@beehaw.org 1 points 2 years ago* (last edited 2 years ago)

No you've got it backwards.

Robots.txt absolutely stops Google from scraping your site.

But they can still learn enough by scraping other sites that link to yours to build a concrete picture of the contents of your website and they will use that info to populate search results that link to you.

If you don't want to appear in search results, then you need to tell Google which pages to hide, and to tell them that you have to allow them to scrape your site.

[-] drwho@beehaw.org 3 points 2 years ago

Very early on, at least, their spiders respected robots.txt.

I know there are folks that have all of the Big G in their robots.txt files on principle, might want to ask them if it works or not.

[-] chameleon@kbin.social 3 points 2 years ago

I do and I can confirm there are no requests (except for robots.txt and the odd /favicon.ico). Google sorta respects robots.txt. They do have a weird gotcha though: they still put the URLs in search, they just appear with an useless description. Their suggestion to avoid that can be summarized as: don't block us, let us crawl and just tell us not to use the result, just trust us! when they could very easily change that behavior to make more sense. Not a single damn person with Google blocked in robots.txt wants to be indexed, and their logic on password protecting kind of makes sense but my concern isn't security, it's that I don't like them (or Bing or Yandex).

Another gotcha I've seen linked is that their ad targeting bot for Google AdSense (different crawler) doesn't respect a * exclusion, but that kind of makes sense since it will only ever visit your site if you place AdSense ads on it.

And I suppose they'll train Bard on all data they scraped because of course. Probably no way to opt out of that without opting out of Google Search as well.

[-] drwho@beehaw.org 1 points 2 years ago

Now that's a dirty trick.

[-] acastcandream@beehaw.org 10 points 2 years ago

I’m guessing this question is rhetorical lol

[-] snowbell@beehaw.org 4 points 2 years ago

There would have been no reason for people to care before they scraped all the websites.

[-] breaks@lemmy.studio 18 points 2 years ago

But for large website operators, the choice to block large language model (LLM) crawlers isn't as easy as it may seem. Making some LLMs blind to certain website data will leave gaps of knowledge that could serve some sites very well (such as sites that don't want to lose visitors if ChatGPT supplies their information for them), but it may also hurt others. For example, blocking content from future AI models could decrease a site's or a brand's cultural footprint if AI chatbots become a primary user interface in the future. As a thought experiment, imagine an online business declaring that it didn't want its website indexed by Google in the year 2002—a self-defeating move when that was the most popular on-ramp for finding information online.

Really curious how this will end up

[-] axibzllmbo@beehaw.org 4 points 2 years ago

That's an interesting point that I hadn't considered, the comparison to Google indexing in the early 2000's may prove to be very apt with the number of people I've seen using chat GPT as a search engine.

[-] abhibeckert@beehaw.org 17 points 2 years ago

I'd bet sites blocking ChatGPT will regret it when (not if) Bing starts using it for search engine relevance.

[-] acastcandream@beehaw.org 14 points 2 years ago

That’s because you block the GPT crawler doesn’t mean you are no longer indexed

[-] renard_roux@beehaw.org 3 points 2 years ago

Serious question — you think any amount of AI will make people use Bing? 🤔

[-] Sproux@lemmy.dbzer0.com 5 points 2 years ago* (last edited 2 years ago)

I started using it this year because its actually been giving me decent results unlike google. We're...in a dark timeline

[-] renard_roux@beehaw.org 2 points 2 years ago* (last edited 2 years ago)

Hmm. I have an incredibly strong aversion to everything Microsoft, so even giving Bing a chance is difficult. However, I must admit that I can recognize the part about Google not delivering. I even went so far as to tamper with the CSS recently just to make Google's results slightly easier to parse.

Maybe it's time to try something new 🤔 I just wish the only viable alternative wasn't made by Microsoft 😓🤢

Dark timeline, indeed! 😔

[-] Sproux@lemmy.dbzer0.com 2 points 2 years ago* (last edited 2 years ago)

No kidding, the only other options ive found are

Duckduckgo, which is serviceable but kind of bad, and

Kagi, which is looking promising and has lots of features i enjoy, but paying for a search engine is a hard sell.

[-] fckgwrhqq2yxrkt@beehaw.org 1 points 2 years ago

Check out Kagi, paid search is extremely worth it. Stop being a product to sell and start being a customer.

[-] abhibeckert@beehaw.org 2 points 2 years ago* (last edited 2 years ago)

It was enough to make me try Bing... which lasted all of about ten seconds (one search) before I ran screaming for the hills back to Duck Duck Go.

So no, I don't think this can make people use Bing - that product has so many problems I'm not sure it will ever be good enough.

Having said that - ChatGPT is really good at interpreting a user search term and equally good at understanding the contents of an arbitrary webpage. It's a perfect tool to build a search engine around, and I can't wait for someone more competent than Bing to do just that.

[-] renard_roux@beehaw.org 1 points 2 years ago* (last edited 2 years ago)

It's incredible just how bad Microsoft is at product design 😳

Every time I visit with parents, or other older-gen acquaintances, I'm the go-to "help me fix this tiny issue with my [software]", and invariably it will be Outlook or Word or some other flavor of Office software, and I just want to scream. Even just looking at that nightmare top bar in Word triggers nausea.

I know I'm just not the target audience, and that there is some degree of obfuscation-by-design going on to enable their meaty underbelly of third party support companies (an entire fucking industry just to deal with their software, which in turn need certification, and so forth). It's still difficult to understand how people can interact with the overwhelming horribleness of it all, day in and day out, and not go on a rage-fueled rampage.

Slightly unrelated, if not tangential — I mentioned in a comment somewhere yesterday that I love MacOS, but I wouldn't touch an iPhone with a ten foot pole.

Look at my shiny iPhone! someone will say, and then shove it in your face. It's the new 2023 edition!

That's nice, you'll reply, but why are the icons so big?

Well you see...

Couldn't you make them smaller, and increase the grid size a bit? 4x6 seems a bit wasteful.

Yeah well the thing is...

Why did you place all the apps at the top of the screen? Isn't that where they'll be the most difficult to reach on a screen that big? Move that app to the bottom of the screen, would you, just to try it out?

Yes, uhm, that's not compatible with Apple's design principles. They've been the bleeding edge of mobile phones since the first iPhone, and I'm sure they know best!

You know, now that you mention it, it does look almost identical to the original iPhone screen layout! Imagine that, being right on your first try! Very impressive. Oh well, no problem, I'm happy you enjoy your phone ❤️

🤢

[-] dbilitated@aussie.zone 15 points 2 years ago

I'd rather like it if they train it on stuff I say. I want the AI of tomorrow to reflect my thoughts.

seriously I would much prefer gold tier journalism and news sites let it crawl so when people use it to make choices in the future they're guided to better choices.

it is honestly so hard to know what will happen though, it's so complicated it's virtually guaranteed we're not correctly anticipating the consequences of any of this. I'm not really even talking about the AI, I'm talking about the effects on society which are a lot more complex.

[-] Carion@lemmy.antemeridiem.xyz 13 points 2 years ago

It's just about the money really, they want their cut of the AI money cake.

[-] dbilitated@aussie.zone 1 points 2 years ago

I'm actually cool with that. have a big dataset? charge a fee to use it for training

[-] GoOnASteamTrain@lemmy.ml 13 points 2 years ago

Thank you :) just added this to my robots.txt :)

this post was submitted on 13 Aug 2023

179 points (100.0% liked)

Technology

40869 readers

330 users here now

A nice place to discuss rumors, happenings, innovations, and challenges in the technology sphere. We also welcome discussions on the intersections of technology and society. If it’s technological news or discussion of technology, it probably belongs here.

Remember the overriding ethos on Beehaw: Be(e) Nice. Each user you encounter here is a person, and should be treated with kindness (even if they’re wrong, or use a Linux distro you don’t like). Personal attacks will not be tolerated.

Subcommunities on Beehaw:

This community's icon was made by Aaron Schneider, under the CC-BY-NC-SA 4.0 license.

founded 3 years ago

MODERATORS

alyaza@beehaw.org

TheRtRevKaiser@beehaw.org

gyrfalcon@beehaw.org

rs5th@beehaw.org

coldredlight@beehaw.org

SemioticStandard@beehaw.org

TheRtRevKaiser@kbin.social

remington@beehaw.org