If you wanted your web page excluded from being crawled or indexed by search engines and other robots, robots.txt
was your tool of choice, with some additional stuff like <meta name="robots" value="noindex" />
or <a href="…" rel="nofollow">
sprinkled in.
It is getting more complicated with AI crawlers. Let’s have a look.
Traditional functions
- One of the first goals of
robots.txt
was to prevent web crawlers from hogging the bandwidth and compute power of a web site. Especially if the site contained dynamically generated context, possibly infinite. - Another important goal was to prevent pages or their content from being found using search engines. There, the above-mentioned
<meta>
and<a rel>
tags came in handy as well. - A non-goal was to use it for access control, even though it was frequently misunderstood to be useful for that purpose.
New challenges
Changes over time
The original controls focused on services which would re-inspect the robots.txt
file and the web page on a regular basis and update it accordingly.
Therefore, it didn’t work well for archive sites: Should they delete old contents on policy changes or keep them archived? This is even more true for AI training material, as deleting training material from existing models is very costly.
Commercialization
Even in the pre-AI age, some sites desired a means to prevent commercial services from monetizing their content. However, both the number of such services and their opponents were small.
With the advent of AI scraping, the problem became more prominent, resulting e.g. in changes in the EU copyright law, allowing web site owners to specify whether they want their site crawled for text and data mining.
As a result, the the Text and Data Mining Reservation Protocol Community Group of the World Wide Web Consortium proposed a protocol to allow fine-grained indication of which content on a web site was free to crawl and which parts would require (financial or other) agreements. The proposal includes options for a global policy document or returning headers (or <meta>
tags) with each response.
Also, Google started their competing initiative to augment robots.txt
a few days ago.
None of these new features are implemented yet, neither in a web servers or CMS, nor by crawlers. So we need workarounds.
[Added 2023-08-31] Another upcoming “standard” is ai.txt
, modeled after robots.txt
. It distinguishes among media types and tries to fulfill the EU TDM directive. In a web search today, I did not find crawler support for it, either.
AI crawlers
Unfortunately, the robots.txt database stopped receiving updates around 2011. So, here is an attempt at keeping a list of robots related to AI crawling.
Organization | Bot | Notes |
---|---|---|
Common Crawl | CCbot | Used for many purposes. |
OpenAI GPT | Commonly listed as their crawler. However, I could not find any documentation on their site and no instances in my server logs. | |
GPTBot | The crawler used for further refinement. [Added 2023-08-09] | |
ChatGPT-User | Used by their plugins. | |
Google Bard | — | Apparently no separate crawl for Bard. |
Meta AI | — | No information for LlaMA. |
Please let me know when additional information becomes available.
Control comparison
Search engines and social networks support quite a bit of control over what is indexed and how it is used/presented.
- Prevent crawling by
robots.txt
, HTML tags, and paywalls. - Define preview information by HTML tags, Open Graph, Twitter Cards, …
- Define preview presentation using oEmbed.
For use of AI context, most of this is lacking and “move fast and break things” is still the motto. Giving users fine-grained control over how their content is used would help with the discussions.
Even though the users in the end might decide they actually do want to have (most or all) of their context indexed for AI and other text processing…
Leave a Reply
You must be logged in to post a comment.