With the rise of LLMs and their need for vast amounts of training data, there is also a need for a way to indicate whether data may be used to train such models. This article proposes two ways for websites to opt-out from being crawled for training purposes.
This is done while being fully aware of the limitations of relying on the developers of AI models to respect this convention. Server-side blocks could archive this goal more effectively, but they also rely on the convention of bots identifying themselves by a user-agent, and it is increasingly difficult to maintain a comprehensive list of user-agents used to crawl training data. Establishing a voluntary self-commitment can at least provide a convenient way of dealing with good-faith actors, while at the same time providing a way of identifying bad-faith actors.
UPDATE (01.01.2025): There is an almost identical Internet-Draft by Fabrice Canel and Krishna Madhavan that I wasn’t aware of at the time of writing. See Prior Work.
Specification
This proposal builds on the well-established Robots Exclusion Protocol and provides an option for robots.txt, as defined in RFC 9309, and the robots <META> tag.
robots.txt
The first proposal is to add an additional directive to the robots.txt standard, allowing websites to indicate whether the data they provide may be used for training purposes, without having to opt out of being crawled altogether.
user-agent: *
model-training: disallow
Possible values for model-training
are allow
(default) and disallow
.
All crawlers collecting data for AI training MUST evaluate this directive after evaluating whether they are allowed to access the URL in question. The most specific match found MUST be used.
The value allow
indicates that a resource MAY be used for model training, whereas a value of disallow
indicates that the resource MUST NOT be used for model training.
To illustrate consider the following example:
user-agent: *
disallow: /private
model-training: disallow
user-agent: *
allow: /ai
model-training: allow
The first group indicates that bots are allowed to crawl the whole site, except for URLs matching /private
, but are not allowed to use the collected data for training purposes. The second group adds a more specific match, indicating that URLs matching /ai
may be used for training purposes.
Robots <META> tag
The second proposal is to add TRAINING
and NOTRAINING
as an additional possible value to the robots <META> tag.
The robots <META> tag is a de facto standard described by robotstxt.org and can take the values INDEX
, NOINDEX
, FOLLOW
and NOFOLLOW
. The newly introduced values would indicate, in the case of TRAINING
, that the resource MAY be used for model training and, in the case of NOTRAINING
, that it MUST NOT be used for model training.
The following <META> tag indicates that an HTML resource may be indexed and used for model training:
<META NAME="ROBOTS" CONTENT="INDEX, TRAINING">
The following <META> tag still indicates that an HTML resource may be indexed, but excludes the resource from being used for model training:
<META NAME="ROBOTS" CONTENT="INDEX, NOTRAINING">
Prior Work
ai.robots.txt maintains a list of known user agents of robots used for model training. While such a list is necessary to block crawlers on the server side, it is not sustainable considering the growing number of organizations trying to train AI models.
Update (01.01.2025)
Fabrice Canel and Krishna Madhavan submitted an almost identical proposal: https://www.ietf.org/archive/id/draft-canel-robots-ai-control-00.html
Since their I-D and my proposal differ only in minor naming conventions, this article can be seen as an endorsement of their work, and I hope it will lead to an RfC and widespread adoption.
J. Jimenez Ericsson suggests in yet another I-D to encode the purpose of a crawler in its user-agent and to extend the robots.txt syntax to allow matching against these purposes.
Although I have considered a similar approach with meta-user-agents, I have abandoned the idea in favour of an additional directive, as this can be archived without introducing any backward compatibility issues (model-training
would be considered an “Other Record” by RFC 9309). Using regular expressions to match user-agents is also likely to introduce further problems, given the diversity of existing user agents and the need to change already established AI crawler user agents.