WisPaper
WisPaper
Search
QA
Pricing
TrueCite

Can LLMs be effectively trained without using copyrighted data?

Yes, LLMs can be trained without copyrighted data, but current models still memorize and reproduce it. New methods show promise for compliance.

Direct answer

Yes, it is theoretically possible to train large language models (LLMs) without using copyrighted data, but current models still memorize and reproduce copyrighted text. A 2024 study found that popular LLMs like GPT-4 and Alpaca produce fewer than 1% potentially copyright-infringing outputs in realistic use, while others produce far more [3]. A new theoretical method from 2024 shows how to modify the core training math (softmax regression) to prevent generating copyrighted data, though it has not yet been implemented in practice [1].

5sources cited

This article was generated with WisPaper-powered search and paper analysis.

Do current LLMs actually reproduce copyrighted data?

Yes, and the amount varies hugely by model. A 2024 systematic analysis tested instruction-finetuned LLMs in a realistic end-user scenario, using a 160-character threshold (borrowed from German copyright law) to flag potential infringements [3]. The best performers—GPT-4, GPT-3.5, Alpaca, and Luminous—produced very few violations, with Alpaca and Luminous showing the lowest absolute numbers. In contrast, other models generated far more potentially copyrighted text. This means that even without special training, some models are already fairly compliant, but none are perfectly safe.

Can we train LLMs to avoid copyrighted data from the start?

A 2024 paper provides a mathematical proof that it is possible. The researchers showed that LLM training is essentially a softmax regression problem, and they designed a modified version of that regression that explicitly prevents the model from outputting copyrighted data [1]. This is a theoretical method—it has not been implemented in a real LLM yet—but it demonstrates that the core optimization process can be altered to block copyright reproduction at the mathematical level, not just as a filter after training.

Is training on copyrighted data even illegal?

Not necessarily—the law is still catching up. A 2025 legal analysis concluded that the act of training an AI model does not itself constitute 'use' of a copyrighted work under Russian law, because training does not reproduce the protected expression or give a human perceptible access to it [2]. However, the same analysis notes that EU, US, and Japanese laws vary, and many jurisdictions are creating exceptions for text and data mining. The key legal issue is not training itself, but whether the model can later reproduce copyrighted content. This means that even if training on copyrighted data is legal, models that memorize and output it can still violate copyright.

Sources used in this answer

1

How to Protect Copyright Data in Optimization of Large Language Models?

Proposes a theoretical method to modify softmax regression during LLM training to prevent the model from generating copyrighted data, though not yet implemented.

2

Reproducing or data mining: The copyright law dilemma of AI training

Concludes that AI model training does not legally constitute 'use' of copyrighted works under Russian law, but recommends exceptions for temporary copies needed for text and data mining.

3

LLMs and Memorization: On Quality and Specificity of Copyright Compliance

Found that among popular LLMs, Alpaca, GPT-4, GPT-3.5, and Luminous produce the fewest potential copyright violations when tested with a 160-character threshold in realistic scenarios.

4

MCP-enabled LLM for meta-optics inverse design: leveraging differentiable solver without LLM expertise.

Demonstrates a framework (MCP) that lets LLMs access specialized code templates for meta-optics design, achieving high success rates with structured prompting—unrelated to copyright but shows LLMs can be guided to avoid certain outputs.

5

Few-shot training LLMs for project-specific code-summarization

Shows that few-shot training with GPT-3 Codex significantly improves code summarization using project-specific data, indicating LLMs can learn from very limited non-copyrighted examples.