Do current LLMs actually reproduce copyrighted data?
Yes, and the amount varies hugely by model. A 2024 systematic analysis tested instruction-finetuned LLMs in a realistic end-user scenario, using a 160-character threshold (borrowed from German copyright law) to flag potential infringements [3]. The best performers—GPT-4, GPT-3.5, Alpaca, and Luminous—produced very few violations, with Alpaca and Luminous showing the lowest absolute numbers. In contrast, other models generated far more potentially copyrighted text. This means that even without special training, some models are already fairly compliant, but none are perfectly safe.
Can we train LLMs to avoid copyrighted data from the start?
A 2024 paper provides a mathematical proof that it is possible. The researchers showed that LLM training is essentially a softmax regression problem, and they designed a modified version of that regression that explicitly prevents the model from outputting copyrighted data [1]. This is a theoretical method—it has not been implemented in a real LLM yet—but it demonstrates that the core optimization process can be altered to block copyright reproduction at the mathematical level, not just as a filter after training.
Is training on copyrighted data even illegal?
Not necessarily—the law is still catching up. A 2025 legal analysis concluded that the act of training an AI model does not itself constitute 'use' of a copyrighted work under Russian law, because training does not reproduce the protected expression or give a human perceptible access to it [2]. However, the same analysis notes that EU, US, and Japanese laws vary, and many jurisdictions are creating exceptions for text and data mining. The key legal issue is not training itself, but whether the model can later reproduce copyrighted content. This means that even if training on copyrighted data is legal, models that memorize and output it can still violate copyright.
Sources used in this answer
How to Protect Copyright Data in Optimization of Large Language Models?
Proposes a theoretical method to modify softmax regression during LLM training to prevent the model from generating copyrighted data, though not yet implemented.
Reproducing or data mining: The copyright law dilemma of AI training
Concludes that AI model training does not legally constitute 'use' of copyrighted works under Russian law, but recommends exceptions for temporary copies needed for text and data mining.
LLMs and Memorization: On Quality and Specificity of Copyright Compliance
Found that among popular LLMs, Alpaca, GPT-4, GPT-3.5, and Luminous produce the fewest potential copyright violations when tested with a 160-character threshold in realistic scenarios.
MCP-enabled LLM for meta-optics inverse design: leveraging differentiable solver without LLM expertise.
Demonstrates a framework (MCP) that lets LLMs access specialized code templates for meta-optics design, achieving high success rates with structured prompting—unrelated to copyright but shows LLMs can be guided to avoid certain outputs.
Few-shot training LLMs for project-specific code-summarization
Shows that few-shot training with GPT-3 Codex significantly improves code summarization using project-specific data, indicating LLMs can learn from very limited non-copyrighted examples.
