Want to learn more about drafting, negotiating, and understanding intellectual property and technology contracts and have 10 minutes to spare? Grab your morning coffee or afternoon tea and dig into our Tech Contract Quick Bytes—small servings of technical contract insights expertly prepared by our seasoned attorneys. This month, we're discussing generative AI and web scraping.
Shrewd businesses often leverage cutting-edge technology to be more efficient or to offer new products or services. But if companies fail to conduct legal risk assessments before using innovative technology, the anticipated benefits can be quickly outweighed by legal consequences.
Generative artificial intelligence (GenAI) fits that bill, presenting seemingly limitless opportunities but also significant legal risk. Companies should conduct a GenAI legal compliance assessment before launching a GenAI program, particularly with respect to any data retrieval processes used by the model.
GenAI models are trained on terabytes of data, primarily relying on web scraping for retrieval of the vast amounts of data needed. Many web scraping companies assume that if data are publicly available, they are fair game, but that is an easily challenged, flawed assumption. Unauthorized web scraping can lead to copyright infringement, breach of contract, violation of privacy rights, and violation of the Computer Fraud and Abuse Act (CFAA), just to name a few.
- Copyright Infringement. Content or databases of information available on or through websites may be protected by copyright, even if the content or database is not housed behind a paywall. The Copyright Act provides a copyright owner with the exclusive rights to "reproduce," "copy," "distribute," and make "derivative works" of the copyrighted work (among others). Without a license from the owner of the copyright, a GenAI program used to scrape copyright-protected content (e.g., a news article, poem, or art) from a website and make further use of it could lead to a claim of copyright infringement.
- Breach of Contract. Websites typically include legal terms that website operators seek to enforce against users of their site. In these agreements, operators, in many cases, include language that explicitly prohibits web scraping or the use of other similar technologies to protect their rights in the content/data available on their site. To the extent these agreements are deemed enforceable, and a user employs a GenAI program to retrieve data from the site, the user may be subject to a claim of breach of contract.
- Violation of Privacy Rights. The proliferation of U.S. states passing new privacy laws that expand the rights of users in their personal data has made data scraping for GenAI models an increasing concern. Where personal information is obtained, federal and state laws and regulations may require notification, consent, and the ability to opt out of collection or use of the data, depending on the age of the person from whom the data was collected, the type of data collected, and where the data was collected from. And if such laws and regulations were not followed in the initial collection of the data and are later published on a site, legal liability may result and could extend to the entity using the GenAI model.
- Computer Fraud and Abuse Act. Data scraping under the CFAA has been an evolving legal issue that the courts continue to assess. While recent case law has made the CFAA largely inapplicable to data that is accessible to the public, data maintained behind authentication or paywalls (i.e., where users either log in or pay to access the data) may trigger liability under the CFAA, as technical barriers were erected to prevent unauthorized access.
The content of this article is intended to provide a general guide to the subject matter. Specialist advice should be sought about your specific circumstances.