With the rise of LLMs and their need for vast amounts of training data, there is also a need for a way to indicate whether data may be used to train such models. This article proposes two ways for websites to opt-out from being crawled for training purposes.
This is done while being fully aware of the limitations of relying on the developers of AI models to respect this convention. Server-side blocks could archive this goal more effectively, but they also rely on the convention of bots identifying themselves by a user-agent, and it is increasingly difficult to maintain a comprehensive list of user-agents used to crawl training data. Establishing a voluntary self-commitment can at least provide a convenient way of dealing with good-faith actors, while at the same time providing a way of identifying bad-faith actors.
Read more