This paper introduces Synthetic Mixed Training, a novel data augmentation strategy that enables LLMs to effectively internalize domain-specific knowledge during fine-tuning. By combining synthetic Question-Answer (QA) pairs with synthetic documents, the method breaks the performance plateau of traditional synthetic data scaling, ultimately outperforming Retrieval-Augmented Generation (RAG) by 2.6% to 4.4% across multiple benchmarks.
TL;DR
For years, Retrieval-Augmented Generation (RAG) has been the "gold standard" for providing LLMs with new knowledge, as fine-tuning often plateaus or fails to internalize facts reliably. This paper introduces Synthetic Mixed Training, a recipe that finally breaks the RAG ceiling. By mixing synthetic QAs with "Focal Rewriting" documents, the authors demonstrate that LLMs can internalize knowledge with a log-linear scaling curve, outperforming RAG by up to 4.4% on 8B-parameter models.
Background: The Parametric Knowledge Bottleneck
Language models often lack specific facts from niche domains or recent events. While RAG bridges this gap by fetching context at inference time, internalizing this knowledge into model parameters (Parametric Knowledge) is preferred for speed and deeper reasoning. However, "naive" synthetic data—simply rephrasing documents or generating QAs—quickly hits a performance wall. Even using a 70B "Teacher" model to generate data for an 8B "Student" has historically shown diminishing returns, leaving RAG as the undisputed champion.
Why Naive Scaling Fails
The authors' diagnosis is profound:
- QA vs. Documents: Synthetic QAs are efficient at teaching recall behavior (how to think), but scale poorly for factual density. Documents provide the facts but lack the instructional signal for extraction.
- Lack of Diversity: Standard rephrasing (WRAP or Active Reading) often results in documents that are stylistically different but topically redundant. The model stops learning because it keeps seeing the same facts in different clothes.
Methodology: The "Mixed" and "Focal" Recipe
The paper introduces two surgical interventions to fix the scaling curve:
1. Synthetic Mixed Training
Instead of choosing between QAs and documents, the authors use a 1:1 mixing ratio.
- Insight: QAs teach the model "how to use knowledge" (transferable behavior), while documents provide the raw "fact storage."
- Synergy: The behavioral signal from QAs helps the model actually access the facts it learns from the document-based tokens.

2. Focal Rewriting
To solve the diversity problem, the authors propose Focal Rewriting. Instead of asking an LLM to "rewrite this document," they provide a specific question and command: "Focus on the question {query}. Rewrite this document so it is most useful for answering it."
- This forces the generator to highlight different facets of the source text, significantly increasing the semantic and lexical diversity of the training set.

Experimental Mastery
The team tested their recipe on Llama 3.1 and Qwen3 models (1.7B to 14B) across high-difficulty benchmarks: QuaLITY (long document comprehension), LongHealth (medical), and FinanceBench (finance).
Key Results:
- Log-Linear Scaling: Unlike previous methods that flattened at 100M tokens, this recipe continues to improve up to 700M tokens.
- Beating RAG: The trained Llama 8B model achieved 68.2% on QuaLITY, surpassing RAG's 65.3%.
- Complementary Strength: The best performance was achieved by using the trained model within a RAG pipeline, resulting in a massive 9.1% relative gain.

Industry Value & Conclusion
This work provides a clear engineering playbook for companies with proprietary data:
- Don't just rephrase: Use "Focal Rewriting" to squeeze every drop of information out of your source docs.
- Mix formats: A pure QA or pure document diet will lead to stagnation.
- Bigger is better for learning: Larger models (14B) are significantly more "token-efficient" at internalizing knowledge than smaller ones (1.7B).
The "RAG ceiling" is no longer an insurmountable barrier. With the right synthetic mixture, our models can finally "know" the data they represent.
