Skip to main content

Article · 03.05.2024

How to protect your content against the LLM-scraper

Generative AI and the large language models (LLMs) powering services such as ChatGPT are having a field day. A recent introduction of an exception for text and data mining in the DSM-Directive means that you as a rights holder must now actively do something to avoid your works being used for training the models.

Author: Mathias Bartholdy, Senior Attorney (L)

What to do?

From the DSM-directive (2019/790), we know that the author must "expressly" reserve the use of the work for text and data mining in an "appropriate manner". The directive elaborates that if the content is publicly available online, it may be appropriate to do so with "machine-readable means, , including metadata and terms and conditions of a website or a service ". It can also be done by contractual agreements or a unilateral declaration.

And that is it. In the absence of standards in the area (which may come), this is what we know so far.

If you need to translate it into concrete advice, it would be advantageous to see it from the LLM provider's point of view. They must adapt their model and scraper to the exemption under the DSM and have a policy for how to do it already one year after the AI Act is published in the Official Journal of the EU (for the interested, see Articles 53(1)(c)+113(b) of the AI Act).

And publication is right around the corner.

We don't yet know much about LLM providers' approaches to complying with the exemption, but we can safely assume that they don't have much interest in going beyond what is absolutely required. Their interest is first and foremost to collect as much data as possible. Therefore, we expect LLM providers to lean on the sparse guidance in the legislation and interpret them as strictly as they dare.
When an LLM provider reads the directive, they probably pay attention to "meta tags" and "terms and conditions" and configure their scraper to look for just that. When scraping a website, the first thing you typically do is download the website's sitemap. Here it gets a list of all pages and subpages that exist on the domain. In here, the LLM provider will (probably) look for a page called something like "terms and conditions" and for whether reservations have been made for text and data mining. The provider will also (probably) make sure that the scraper scans all the meta tags of the pages it wants to scrape.

How to opt-out of text and data mining?

My best advice is:

• Create a subpage called "terms and conditions" on your website. It doesn't necessarily have to be available with a link. It just needs to be present in the sitemap as a minimum. Write something like "We reserve the right to use all content on our website for text and data mining under article 4(3) of the DSM-directive (2019/790)".

• On all pages and subpages with content you do not want scraped, make sure to add a meta tag in the underlying HTML code, e.g. "<meta name="text-and-data-mining" content="no">" or add to the existing description meta tag something like: "<meta name="description" content="[current description of the (sub-)site]. No text and data mining, DSM-directive 2019/790 article 4(3)">

Robots.txt has been suggested by several in the industry as a solution, but as the standard is now, it does not seem to be sufficient on its own. The legislator has simply not mentioned the solution explicitly, and we do not yet know whether the courts will consider it sufficient when the legislator has proposed other solutions.

Over time, standards will emerge, and it will be exciting to follow how the courts interpret the rules. In particular, we are waiting for more about the threshold for when you as a website owner have made a proper opt-out. Until then, we recommend that you follow above recommendations if you as a website owner want to opt out of your content being used by LLMs.