How To Protect Your Content Against The LLM-Scraper

Patrade

Contributor

Generative AI and the large language models (LLMs) powering services such as ChatGPT are having a field day.
Denmark Technology
To print this article, all you need is to be registered or login on Mondaq.com.

Generative AI and the large language models (LLMs) powering services such as ChatGPT are having a field day. A recent introduction of an exception for text and data mining in the DSM-Directive means that you as a rights holder must now actively do something to avoid your works being used for training the models.

What to do?

From the DSM-directive (2019/790), we know that the author must "expressly" reserve the use of the work for text and data mining in an "appropriate manner". The directive elaborates that if the content is publicly available online, it may be appropriate to do so with "machine-readable means, , including metadata and terms and conditions of a website or a service ". It can also be done by contractual agreements or a unilateral declaration.

And that is it. In the absence of standards in the area (which may come), this is what we know so far.

If you need to translate it into concrete advice, it would be advantageous to see it from the LLM provider's point of view. They must adapt their model and scraper to the exemption under the DSM and have a policy for how to do it already one year after the AI Act is published in the Official Journal of the EU (for the interested, see Articles 53(1)(c)+113(b) of the AI Act).

And publication is right around the corner.

We don't yet know much about LLM providers' approaches to complying with the exemption, but we can safely assume that they don't have much interest in going beyond what is absolutely required. Their interest is first and foremost to collect as much data as possible. Therefore, we expect LLM providers to lean on the sparse guidance in the legislation and interpret them as strictly as they dare.

When an LLM provider reads the directive, they probably pay attention to "meta tags" and "terms and conditions" and configure their scraper to look for just that. When scraping a website, the first thing you typically do is download the website's sitemap. Here it gets a list of all pages and subpages that exist on the domain. In here, the LLM provider will (probably) look for a page called something like "terms and conditions" and for whether reservations have been made for text and data mining. The provider will also (probably) make sure that the scraper scans all the meta tags of the pages it wants to scrape.

How to opt-out of text and data mining?

My best advice is:

  • Create a subpage called "terms and conditions" on your website. It doesn't necessarily have to be available with a link. It just needs to be present in the sitemap as a minimum. Write something like "We reserve the right to use all content on our website for text and data mining under article 4(3) of the DSM-directive (2019/790)".
  • On all pages and subpages with content you do not want scraped, make sure to add a meta tag in the underlying HTML code, e.g. "" or add to the existing description meta tag something like: "

Robots.txt has been suggested by several in the industry as a solution, but as the standard is now, it does not seem to be sufficient on its own. The legislator has simply not mentioned the solution explicitly, and we do not yet know whether the courts will consider it sufficient when the legislator has proposed other solutions.

Over time, standards will emerge, and it will be exciting to follow how the courts interpret the rules. In particular, we are waiting for more about the threshold for when you as a website owner have made a proper opt-out. Until then, we recommend that you follow above recommendations if you as a website owner want to opt out of your content being used by LLMs.

The content of this article is intended to provide a general guide to the subject matter. Specialist advice should be sought about your specific circumstances.

We operate a free-to-view policy, asking only that you register in order to read all of our content. Please login or register to view the rest of this article.

See More Popular Content From

Mondaq uses cookies on this website. By using our website you agree to our use of cookies as set out in our Privacy Policy.

Learn More