The use of generative AI models like ChatGPT, DALL-E, or Stable Diffusion has increased tremendously in recent years. These models can generate creative content based on user instructions, such as texts, images, or music. This capability for autonomous creativity is based on the fact that the AI models have “learned” from large datasets how to create such content. A significant portion of these datasets is protected by copyright, leading to substantial legal challenges.
Technological Foundations
Generative AI models are based on machine learning, particularly deep artificial neural networks (ANNs), which are trained to recognize complex patterns in large datasets. These models use learning processes such as supervised, unsupervised, and reinforcement learning to enhance their capabilities. A key aspect is the pre-training and fine-tuning of the models: The base model is initially trained on a general dataset (pre-training) and then adapted to more specific tasks or styles (fine-tuning). This allows the models to be used flexibly for various applications.
Copyright Aspects
According to a recent study by Dornis and Stober, numerous copyright-relevant actions occur during the training of generative AI models. These include:
- Collection, preparation, and storage of training data: This reproduction of copyrighted works occurs during the creation of corpora that serve as the basis for AI training.
- Training of generative AI models: During the training process, especially during pre-training and fine-tuning, reproductions of the works occur “inside” the model. Even if the data is not explicitly stored, it can still be memorized by the model, which counts as reproduction under copyright law.
- Use of generative AI models: Users who apply generative AI systems produce new content through the models, which in turn could be based on the protected training data. This constitutes the use of copyrighted works.
- Public accessibility: When generative AI models are made available for use, either through user applications or as downloads, there is a public accessibility of the works that were used for training and reproduced within the model.
Legal Limitations and Challenges
The current copyright limitations only cover the actions involved in the training of generative AI models in a few, often practically irrelevant cases. The study particularly emphasizes that the limitation for text and data mining (TDM) does not apply. Generative AI models utilize the training data more comprehensively than TDM, as they use not only semantic but also syntactic information and represent these in a vector space. Thus, according to the study, there is a comprehensive reproduction of content that goes beyond what TDM would cover.
DSM Directive
The DSM Directive, which forms the legal basis for TDM, was not designed for the technological developments of creative and productive AI systems and thus explicitly excludes their application. Similarly, the AI Regulation does not take these specific differences into account, leading to legal gray areas.
Relevant Copyright Limitations and Their Application
German copyright law contains various limitation provisions that allow the use of copyrighted works under certain conditions. In the context of training generative AI models, the following limitations are particularly relevant:
- § 44a UrhG – Temporary Reproduction Actions: This limitation allows temporary reproductions that are transient or incidental and form an integral and essential part of a technical process if they have no independent economic value. According to the study by Dornis and Stober, this limitation only applies to the training of AI models to a limited extent, as the reproductions are not merely transient but often of a more long-term nature and go beyond what is technically necessary.
- § 60d UrhG – Text and Data Mining (TDM): § 60d UrhG allows reproductions of works for the purpose of text and data mining for non-commercial scientific research. However, this limitation is barely relevant for generative AI models since the commercial use of such models is not covered by § 60d. The study also highlights that generative models do not only extract semantic information but also utilize syntactic structures, which goes beyond the scope of TDM.
- §§ 60a to 60c UrhG – Uses for Teaching, Science, and Institutions: These limitations allow certain uses of copyrighted works for educational and scientific purposes. However, they are limited to non-commercial contexts and do not directly affect the training of generative AI models, as most models are also used commercially.
- § 44b UrhG – Temporary Reproductions in the Context of Network Access: § 44b UrhG permits temporary reproduction actions necessary to enable access to networks and their content, provided these actions are technically required and have no independent economic purpose. The study assesses this limitation as particularly relevant, but only partially applicable to generative AI models. The main reason is that the reproductions that occur during AI training are often not just temporary but remain permanently stored within the model, thus going beyond the scope of § 44b. The models often memorize the structure and content of the training data, representing a long-term use rather than just a fleeting technical necessity.
As a result, many of these reproductions fall into legal gray areas—or even clearly outside of legal limitations, leading to significant legal uncertainties.
Applicable Law and International Jurisdiction
The study emphasizes that making AI models publicly accessible for use by German users—e.g., through the OpenAI website for ChatGPT—may trigger the application of German copyright law and the jurisdiction of German courts. Since the training data is protected by copyright and reproduced “inside” the models, this constitutes a relevant use under copyright law.
Conclusion and Outlook
The use of generative AI models brings significant legal uncertainties, particularly with regard to copyright infringements during the training and application of these models. The study for the German-speaking legal space shows that the current legal framework is insufficient to adequately address the challenges posed by rapid technological development.
The issue of copyright when training AI with third-party data is dominant and currently represents the majority of inquiries I receive. It is expected that this issue will intensify in the coming years, making it urgently necessary to establish clear legal regulations to protect both the rights of copyright holders and to promote innovation in the field of AI.
The authors specifically conclude that the current copyright limitations, particularly § 44b UrhG, are not sufficient to justify the extensive reproductions and uses of copyrighted works by generative AI models. While some limitations in German copyright law, such as § 44a UrhG and § 60d UrhG, allow for short-term and specific uses, the specific requirements and long-term storage of the models remain unaddressed by these provisions.
- Protecting Business Secrets in Germany: Legal Risks When Employees Forward Emails to Private Accounts - 6. October 2024
- Law Enforcement’s Access to the TOR Network: Investigative Techniques and Legal Implications - 5. October 2024
- The Challenge of Investigating and Defending Against Cryptomessenger Cases in Germany and Europe - 5. October 2024