Do large language models truly understand code or just memorize patterns?

How much do LLMs just copy code they've seen before?

A lot more than you might think. Researchers extracted 20,000 outputs from a code model (each 512 tokens long) and found over 40,125 code snippets that were memorized verbatim from the training data [4]. That's an average of two memorized snippets per output. The study also built a taxonomy of what gets copied—ranging from API usage patterns to entire functions—and found that larger models memorize more than smaller ones, and longer outputs increase the risk. Crucially, the more often a code snippet appeared in the training data, the more likely it was to appear in generated outputs, suggesting that deduplicating training data could reduce memorization [4].

This matters because memorized code can contain security vulnerabilities, sensitive information, or code with restrictive licenses. If you're using an LLM to generate code for a commercial product, you could inadvertently incorporate copyrighted or buggy code. The study recommends using metrics to detect memorization and removing duplicates from training sets to mitigate the problem [4].

What evidence shows that LLMs actually understand code?

The strongest evidence comes from tasks that require planning and domain-specific reasoning. When LLMs are prompted to first plan out solution steps before writing code, their performance jumps by up to 25.4% in Pass@1 (the percentage of problems solved correctly on the first try) compared to generating code directly [1]. This planning phase isn't just regurgitating a pattern—it involves decomposing a complex intent into logical steps, which is a hallmark of understanding. The same study found that human evaluators rated the planned code higher in correctness, readability, and robustness [1].

Further proof comes from specialized domains like bioinformatics. On the BioCoder benchmark, GPT-4 achieved about 50% Pass@K (the percentage of problems solved within K attempts), while smaller models topped out at 25% [2]. The key insight: successful models needed both a long context window (over 2,600 tokens) to understand cross-file dependencies and domain-specific knowledge of bioinformatics algorithms. General coding ability alone wasn't enough—the models had to grasp the biological logic behind the code [2]. This suggests they're not just matching surface patterns but reasoning about the problem domain.

Another study built an IDE plugin that lets developers ask an LLM to explain code, describe API calls, or define domain terms—without writing prompts. In a user study with 32 participants, using the plugin helped complete code-understanding tasks faster than web search [5]. The LLM could explain what a piece of code does, which requires more than pattern matching; it needs to infer intent and map code to concepts.

When does memorization look like understanding—and why does it matter?

The line blurs when the task is similar to something in the training data. A benchmark called SWE-QA-Pro was specifically designed to prevent LLMs from 'cheating' via memorization by using long-tail, obscure repositories that the models are unlikely to have seen [3]. When tested on these novel codebases, direct answering (which relies on memorized knowledge) performed poorly—Claude Sonnet 4.5 scored about 13 points lower on direct answers than on agentic workflows that required exploring the codebase [3]. This gap proves that when memorization is blocked, performance drops sharply, revealing the limits of genuine understanding.

The practical takeaway: if you're working on a common problem (like sorting an array or querying a database), the LLM will likely produce correct code through memorization of patterns it has seen thousands of times. But if you're dealing with a novel library, an unusual algorithm, or a private codebase, the model's understanding is much weaker. The SWE-QA-Pro study showed that even small open models (Qwen3-8B) could be trained to surpass GPT-4o by 2.3 points on this benchmark when given agentic training that teaches them to explore code rather than rely on memory [3]. This suggests that genuine understanding can be improved, but it's not the default behavior.

Sources used in this answer

Self-Planning Code Generation with Large Language Models

Self-planning code generation improved Pass@1 by up to 25.4% over direct generation, showing LLMs can decompose complex intents into logical steps before coding [1].

2024 · Xue Jiang, Yihong Dong, Lecheng Wang, Zheng Fang, Qiwei Shang, Ge Li, Zhi Jin, Wenpin Jiao · ACM Trans. Softw. Eng. Methodol.

Original

BioCoder: a benchmark for bioinformatics code generation with large language models.

On the BioCoder benchmark, GPT-4 achieved ~50% Pass@K versus ~25% for smaller models, proving domain-specific knowledge beyond pattern matching is required [2].

2024 · Xiangru Tang, Bill Qian, Rick Gao, Jiakang Chen, Xinyun Chen, Mark B Gerstein · Bioinformatics (Oxford, England)

Original

SWE-QA-Pro: A Representative Benchmark and Scalable Training Recipe for Repository-Level Code Understanding

SWE-QA-Pro found a ~13-point gap between direct answering and agentic workflows for Claude Sonnet 4.5, showing LLMs rely on memorization when possible [3].

2026 · Songcheng Cai, Zhiheng Lyu, Yuansheng Ni, Xiangchao Chen, Baichuan Zhou, Shenzhe Zhu, Yi Lu, Haozhe Wang, Chi Ruan, Benjamin Schneider, Weixu Zhang, Xiang Li, Andy Zheng, Yuyu Zhang, Ping Nie, Wenhu Chen · arXiv (Cornell University)

WisPaper

Original

Unveiling Memorization in Code Models

Extracting 20,000 outputs from a code model yielded over 40,125 memorized snippets; larger models and longer outputs increased memorization rates [4].

2024 · Zhou Yang, Zhipeng Zhao, Chenyu Wang, Jieke Shi, Dongsun Kim, DongGyun Han, David Lo · ICSE

Original

Using an LLM to Help With Code Understanding

An LLM-based IDE plugin helped 32 participants understand code faster than web search, demonstrating practical understanding of code intent and API usage [5].

2024 · Daye Nam, Andrew Macvean, Vincent J. Hellendoorn, Bogdan Vasilescu, Brad A. Myers · ICSE

Original