How to block AI crawlers with robots.txt [2 Updates]

If you wanted your web page excluded from being crawled or indexed by search engines and other robots, robots.txt was your tool of choice, with some additional stuff like <meta name="robots" value="noindex" /> or <a href="…" rel="nofollow"> sprinkled in.

It is getting more complicated with AI crawlers. Let’s have a look.

Traditional functions

  • One of the first goals of robots.txt was to prevent web crawlers from hogging the bandwidth and compute power of a web site. Especially if the site contained dynamically generated context, possibly infinite.
  • Another important goal was to prevent pages or their content from being found using search engines. There, the above-mentioned <meta> and <a rel> tags came in handy as well.
  • A non-goal was to use it for access control, even though it was frequently misunderstood to be useful for that purpose.

New challenges

Changes over time

The original controls focused on services which would re-inspect the robots.txt file and the web page on a regular basis and update it accordingly.

Therefore, it didn’t work well for archive sites: Should they delete old contents on policy changes or keep them archived? This is even more true for AI training material, as deleting training material from existing models is very costly.

Commercialization

Even in the pre-AI age, some sites desired a means to prevent commercial services from monetizing their content. However, both the number of such services and their opponents were small.

With the advent of AI scraping, the problem became more prominent, resulting e.g. in changes in the EU copyright law, allowing web site owners to specify whether they want their site crawled for text and data mining.

As a result, the the Text and Data Mining Reservation Protocol Community Group of the World Wide Web Consortium proposed a protocol to allow fine-grained indication of which content on a web site was free to crawl and which parts would require (financial or other) agreements. The proposal includes options for a global policy document or returning headers (or <meta> tags) with each response.

Also, Google started their competing initiative to augment robots.txt a few days ago.

None of these new features are implemented yet, neither in a web servers or CMS, nor by crawlers. So we need workarounds.

[Added 2023-08-31] Another upcoming “standard” is ai.txt, modeled after robots.txt. It distinguishes among media types and tries to fulfill the EU TDM directive. In a web search today, I did not find crawler support for it, either.

AI crawlers

Unfortunately, the robots.txt database stopped receiving updates around 2011. So, here is an attempt at keeping a list of robots related to AI crawling.

OrganizationBotNotes
Common CrawlCCbotUsed for many purposes.
OpenAI GPTOpenAICommonly listed as their crawler. However, I could not find any documentation on their site and no instances in my server logs.
GPTBotThe crawler used for further refinement. [Added 2023-08-09]
ChatGPT-UserUsed by their plugins.
Google BardApparently no separate crawl for Bard.
Meta AINo information for LlaMA.

Please let me know when additional information becomes available.

Control comparison

Search engines and social networks support quite a bit of control over what is indexed and how it is used/presented.

For use of AI context, most of this is lacking and “move fast and break things” is still the motto. Giving users fine-grained control over how their content is used would help with the discussions.

Even though the users in the end might decide they actually do want to have (most or all) of their context indexed for AI and other text processing…

,

Let’s stay in touch!

Receive a mail whenever I publish a new post.

About 1-2 Mails per month, no Spam.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Web apps