How much does fine-tuning help open-source models catch up?
Fine-tuning on domain-specific data can erase the performance gap entirely, sometimes even flipping it. In a study of medical billing code extraction from 499,601 radiology reports, a fine-tuned 4-billion-parameter open-source model (MediPhi-Instruct 4B) achieved an F1-score of 87.79%, outperforming every proprietary model tested—including GPT-5, GPT-4.1, and Gemini 2.5 Flash—on a real-world sample of 500 reports. The fine-tuned model's F1 of 70.32% beat Gemini 2.5 Flash's 58.22% by a statistically significant margin [1]. This shows that a small, specialized open-source model can beat a much larger general-purpose proprietary one when trained on the right data.
The same pattern holds for medical evidence summarization. Fine-tuning open-source models like LongT5 on 8,161 pairs of systematic reviews and summaries brought their performance close to GPT-3.5's zero-shot results, and smaller fine-tuned models sometimes even outperformed larger zero-shot proprietary ones [7]. In ophthalmology question-answering, adding a retrieval-augmented generation (RAG) pipeline boosted open-source Llama-3's accuracy by 23.85 percentage points, nearly matching GPT-4-turbo's performance [4]. These results make a clear case: fine-tuning and RAG are powerful equalizers.
Where do proprietary models still hold a clear advantage?
On broad, zero-shot benchmarks that test general reasoning, coding, and multimodal understanding, proprietary models still maintain a lead. In gastroenterology clinical reasoning using board-style multiple-choice questions, the best proprietary model (o1-preview) scored 82.0% accuracy, while the best open-source model (Llama3.3-70b) reached only 65.7%—a gap of over 16 percentage points [2]. Similarly, on a benchmark requiring models to implement novel machine learning research code from 2024-2025 papers, Gemini-2.5-Pro-Preview (proprietary) led with a 37.3% success rate, while the best open-source models lagged behind [6].
In multidimensional student skill assessment, proprietary models GPT-4o and Claude 3.7 Sonnet achieved 84.0% and 88.0% accuracy respectively, significantly outperforming open-source alternatives [9]. And in multimodal understanding, the open-source InternVL 1.5 achieved state-of-the-art results on 8 of 18 benchmarks, but still didn't surpass GPT-4V on all tasks [3]. The pattern is consistent: for tasks requiring broad knowledge, complex reasoning, or handling of diverse inputs without task-specific tuning, proprietary models still have an edge.
What advantages do open-source models offer beyond raw performance?
Open-source models provide critical advantages in privacy, customization, and cost that proprietary models cannot match. In a hospital setting, a locally deployed open-source LLM with RAG achieved 92.3% top-10 retrieval accuracy on administrative documents, all while keeping sensitive patient data on-premises [8]. This is impossible with cloud-based proprietary models due to data privacy regulations. Similarly, in radiology report simplification, the open-source Llama-3-70b was rated non-inferior to leading proprietary models in 4 out of 5 quality categories, while offering full transparency and the ability to run locally [5].
Quantization techniques further amplify these advantages. In ophthalmology QA, 4-bit quantization of open-source models proved as effective as 8-bit while requiring half the computational resources, making them viable in resource-constrained environments [4]. The BioMistral model, fine-tuned from Mistral on biomedical data, achieved competitive performance against proprietary counterparts while being freely available for customization [11]. And DeepSeek LLM 67B Chat, an open-source model, was shown to surpass GPT-3.5 in open-ended evaluations [10]. These findings show that for many real-world applications—especially in healthcare, education, and resource-limited settings—open-source models are not just catching up, but are already the practical choice.
Sources used in this answer
Comparison of proprietary and fine-tuned large language models for multi-label classification of billing codes from radiology reports.
A fine-tuned 4B open-source model outperformed GPT-5 and Gemini 2.5 Flash in medical billing code extraction, achieving 70.32% F1 vs. 58.22% on real-world radiology reports.
Benchmarking proprietary and open-source language and vision-language models for gastroenterology clinical reasoning.
Proprietary models (o1-preview, 82.0%) outperformed open-source models (Llama3.3-70b, 65.7%) in gastroenterology clinical reasoning by over 16 percentage points.
How far are we to GPT-4V? Closing the gap to commercial multimodal models with open-source suites
Open-source InternVL 1.5 achieved state-of-the-art results on 8 of 18 multimodal benchmarks, narrowing the gap with proprietary models like GPT-4V.
Advancing Question-Answering in Ophthalmology With Retrieval-Augmented Generation: Benchmarking Open-Source and Proprietary Large Language Models.
Adding RAG boosted open-source Llama-3's accuracy by 23.85% in ophthalmology QA, nearly matching GPT-4-turbo's performance.
Performance of open-source and proprietary large language models in generating patient-friendly radiology chest CT reports
Open-source Llama-3-70b was rated non-inferior to leading proprietary models in 4 of 5 quality categories for generating patient-friendly radiology reports.
ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code
On novel ML research code implementation, the best proprietary model (Gemini-2.5-Pro-Preview) achieved 37.3% success, with open-source models trailing behind.
Closing the gap between open source and commercial large language models for medical evidence summarization
Fine-tuning open-source LongT5 on medical summaries brought its performance close to GPT-3.5 zero-shot, with smaller fine-tuned models sometimes outperforming larger zero-shot ones.
Evaluation of Chunking and Embedding Strategies for Local Document Retrieval Using an Open-Source LLM in a Hospital.
A locally deployed open-source RAG system achieved 92.3% top-10 retrieval accuracy on hospital administrative documents, enabling privacy-preserving information retrieval.
Assessing Multiple Student Skill Dimensions Using Large Language Models
Proprietary models GPT-4o (84.0%) and Claude 3.7 Sonnet (88.0%) significantly outperformed open-source models in multidimensional student skill assessment.
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
Open-source DeepSeek LLM 67B Chat surpassed GPT-3.5 in open-ended evaluations, demonstrating strong performance in code, math, and reasoning.
BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains
Open-source BioMistral, fine-tuned on biomedical data, achieved competitive performance against proprietary counterparts on 10 medical QA tasks.
