Journal tags: robots.txt

2

sparkline

Switch

A bit has been flipped on Google Search.

Previously, the Googlebot would index any web page it came across, unless a robots.txt file said otherwise.

Now, a robots.txt file is required in order for the Googlebot to index a website.

This puzzles me. Until now, Google was all about “organising the world’s information and making it accessible.” This switch-up will limit “the world’s information” to “the information on websites that have a robots.txt file.”

They’re free to do this. Despite what some people think, Google isn’t a utility. It’s a business. Other search engines are available, with different business models. Kagi. Duck Duck Go. Google != the World Wide Web.

I am curious about this latest move with Google Search though. I’d love to know if it only applies to Google’s search bot. Google has other bots out crawling the web: Adsbot-Google, Google-Extended, Googlebot-Image, GoogleOther, Mediapartners-Google. I’m probably missing a few.

If the new default only applies to the searchbot and doesn’t include say, the crawler that’s fracking the web in order train Google’s large language model, then this is how things work now:

  • Your website won’t appear in search results unless you explicitly opt in.
  • Your website will be used as training data unless you explicitly opt out.

It would be good to get some clarity on this. Alas, the Google Search team are notoriously tight-lipped so I’m not holding my breath.

Crawlers

A few months back, I wrote about how Google is breaking its social contract with the web, harvesting our content not in order to send search traffic to relevant results, but to feed a large language model that will spew auto-completed sentences instead.

I still think Chris put it best:

I just think it’s fuckin’ rude.

When it comes to the crawlers that are ingesting our words to feed large language models, Neil Clarke describes the situtation:

It should be strictly opt-in. No one should be required to provide their work for free to any person or organization. The online community is under no responsibility to help them create their products. Some will declare that I am “Anti-AI” for saying such things, but that would be a misrepresentation. I am not declaring that these systems should be torn down, simply that their developers aren’t entitled to our work. They can still build those systems with purchased or donated data.

Alas, the current situation is opt-out. The onus is on us to update our robots.txt file.

Neil handily provides the current list to add to your file. Pass it on:

User-agent: CCBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Omgilibot
Disallow: /

User-agent: FacebookBot
Disallow: /

In theory you should be able to group those user agents together, but citation needed on whether that’s honoured everywhere:

User-agent: CCBot
User-agent: ChatGPT-User
User-agent: GPTBot
User-agent: Google-Extended
User-agent: Omgilibot
User-agent: FacebookBot
Disallow: /

There’s a bigger issue with robots.txt though. It too is a social contract. And as we’ve seen, when it comes to large language models, social contracts are being ripped up by the companies looking to feed their beasts.

As Jim says:

I realized why I hadn’t yet added any rules to my robots.txt: I have zero faith in it.

That realisation was prompted in part by Manuel Moreale’s experiment with blocking crawlers:

So, what’s the takeaway here? I guess that the vast majority of crawlers don’t give a shit about your robots.txt.

Time to up the ante. Neil’s post offers an option if you’re running Apache. Either in .htaccess or in a .conf file, you can block user agents using mod_rewrite:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (CCBot|ChatGPT|GPTBot|Omgilibot| FacebookBot) [NC]
RewriteRule ^ – [F]

You’ll see that Google-Extended isn’t that list. It isn’t a crawler. Rather it’s the permissions model that Google have implemented for using your site’s content to train large language models: unless you opt out via robots.txt, it’s assumed that you’re totally fine with your content being used to feed their stochastic parrots.