Can instruction tuning actually make hallucinations worse?
Yes, if the fine-tuning data includes facts the model didn't learn during pre-training. A controlled study on closed-book question answering found that when instruction tuning introduced new factual knowledge, the model learned those examples much more slowly than ones consistent with its existing knowledge—and once it did learn them, its tendency to hallucinate increased linearly [4]. This suggests that instruction tuning is better at teaching the model how to use what it already knows than at injecting new facts, and forcing new facts in can backfire.
The same study showed that examples consistent with the model's pre-existing knowledge were learned quickly and didn't increase hallucinations, supporting the view that instruction tuning works best when it reinforces existing knowledge rather than adding new information [4].
What makes instruction tuning actually reduce hallucinations?
The most effective approaches pair instruction tuning with a separate verification or correction step. In a medical feature extraction task, a two-phase framework used instruction tuning to teach the model the task, then added a second phase that penalized overconfident wrong answers. This reduced hallucinations by 89.9% (from 3,081 to 311 hallucinated features) and missing features by 88.9% on a private test set of nearly 2,000 patient notes [1]. The framework achieved F1-scores of 0.968–0.983 on the full dataset, outperforming standard in-context learning approaches.
Similarly, a biomedical information extraction system used an external verifier that was instruction-tuned on both correct and incorrect examples. This verifier first identified missing entities and relations, then filtered out wrong ones, boosting F1 scores by up to 20% over in-context learning alone [2]. In the mental health domain, a lightweight type-verification component checked the outputs of an instruction-tuned LLM and fed corrections back, significantly improving extraction accuracy while keeping computational costs low [6].
A prompting-based framework for detecting self-contradictions in instruction-tuned models achieved around 80% F1 score when applied to ChatGPT, and found that 17.7% of ChatGPT's sentences contained self-contradictions [3]. The mitigation algorithm iteratively removed contradictory information without needing external knowledge, showing that even without retrieval, you can catch many errors.
So, should you use instruction tuning to make your LLM more truthful?
Yes, but only if you add a verification or confidence-penalty mechanism. Instruction tuning alone is not a reliable fix for hallucinations—it can even introduce new ones if the training data contains unfamiliar facts [4]. The evidence consistently shows that the best results come from a two-stage approach: first, instruction-tune the model to follow the task format, then add a separate component (verifier, confidence penalty, or type-checker) that catches and corrects errors [1][2][6].
A 2026 review of fact-checking methods concluded that instruction tuning, multi-agent reasoning, and retrieval-augmented generation (RAG) all help, but domain-specific fine-tuning and validated external evidence are critical for factual consistency [5]. In short, instruction tuning is a useful tool, but it's not a magic bullet—you need to pair it with something that checks the model's work.
Sources used in this answer
Medical Feature Extraction From Clinical Examination Notes: Development and Evaluation of a Two-Phase Large Language Model Framework.
A two-phase framework combining instruction tuning with confidence-regularization reduced hallucinations by 89.9% (from 3,081 to 311 features) and missing features by 88.9% in medical text extraction, achieving F1-scores of 0.968–0.983.
Towards Instruction-Tuned Verification for Improving Biomedical Information Extraction with Large Language Models
An external verifier instruction-tuned on positive and negative examples improved biomedical NER and RE F1 scores by up to 20% over in-context learning alone.
Self-contradictory Hallucinations of Large Language Models: Evaluation, Detection and Mitigation
Self-contradictions occurred in 17.7% of ChatGPT's sentences; a prompting-based detector achieved ~80% F1 score and could mitigate contradictions without external knowledge.
Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?
Fine-tuning LLMs on new factual knowledge increased hallucination rates linearly as the model learned those examples, while examples consistent with pre-existing knowledge did not increase hallucinations.
Hallucination to truth: a review of fact-checking and factuality evaluation in large language models
A 2026 review found that instruction tuning, multi-agent reasoning, and RAG improve factual consistency, but domain-specific fine-tuning and validated external evidence remain essential.
Improving unified information extraction in Chinese mental health domain with instruction-tuned LLMs and type-verification component.
Instruction-tuned LLMs combined with a lightweight type-verification component significantly improved extraction accuracy in Chinese mental health texts while reducing computational demands.
