Can LLMs be effectively trained without using copyrighted data?

Do current LLMs actually reproduce copyrighted data?

Yes, and the amount varies hugely by model. A 2024 systematic analysis tested instruction-finetuned LLMs in a realistic end-user scenario, using a 160-character threshold (borrowed from German copyright law) to flag potential infringements [3]. The best performers—GPT-4, GPT-3.5, Alpaca, and Luminous—produced very few violations, with Alpaca and Luminous showing the lowest absolute numbers. In contrast, other models generated far more potentially copyrighted text. This means that even without special training, some models are already fairly compliant, but none are perfectly safe.

Can we train LLMs to avoid copyrighted data from the start?

A 2024 paper provides a mathematical proof that it is possible. The researchers showed that LLM training is essentially a softmax regression problem, and they designed a modified version of that regression that explicitly prevents the model from outputting copyrighted data [1]. This is a theoretical method—it has not been implemented in a real LLM yet—but it demonstrates that the core optimization process can be altered to block copyright reproduction at the mathematical level, not just as a filter after training.

Is training on copyrighted data even illegal?

Not necessarily—the law is still catching up. A 2025 legal analysis concluded that the act of training an AI model does not itself constitute 'use' of a copyrighted work under Russian law, because training does not reproduce the protected expression or give a human perceptible access to it [2]. However, the same analysis notes that EU, US, and Japanese laws vary, and many jurisdictions are creating exceptions for text and data mining. The key legal issue is not training itself, but whether the model can later reproduce copyrighted content. This means that even if training on copyrighted data is legal, models that memorize and output it can still violate copyright.

Sources used in this answer

How to Protect Copyright Data in Optimization of Large Language Models?

Proposes a theoretical method to modify softmax regression during LLM training to prevent the model from generating copyrighted data, though not yet implemented.

2024 · Timothy Chu, Zhao Song, Chiwun Yang · AAAI

Original

Reproducing or data mining: The copyright law dilemma of AI training

Concludes that AI model training does not legally constitute 'use' of copyrighted works under Russian law, but recommends exceptions for temporary copies needed for text and data mining.

2025 · A. A. Nikiforov · Digital Law Journal

Original

LLMs and Memorization: On Quality and Specificity of Copyright Compliance

Found that among popular LLMs, Alpaca, GPT-4, GPT-3.5, and Luminous produce the fewest potential copyright violations when tested with a 160-character threshold in realistic scenarios.

2024 · Felix B. Mueller, Rebekka Görge, Anna K. Bernzen, Janna C. Pirk, Maximilian Poretschkin · Proceedings of the AAAI/ACM Conference on AI Ethics and Society

Original

MCP-enabled LLM for meta-optics inverse design: leveraging differentiable solver without LLM expertise.

Demonstrates a framework (MCP) that lets LLMs access specialized code templates for meta-optics design, achieving high success rates with structured prompting—unrelated to copyright but shows LLMs can be guided to avoid certain outputs.

2025 · Yi Huang, Bowen Zheng, Yunxi Dong, Hong Tang, Huan Zhao, Rakibul Hasan Shawon, Sensong An, Hualiang Zhang · Nanophotonics (Berlin, Germany)

Original

Few-shot training LLMs for project-specific code-summarization

Shows that few-shot training with GPT-3 Codex significantly improves code summarization using project-specific data, indicating LLMs can learn from very limited non-copyrighted examples.

2022 · Toufique Ahmed, Premkumar T. Devanbu · ASE

Original