Are open-source LLMs catching up to proprietary models in performance?

How much does fine-tuning help open-source models catch up?

Fine-tuning on domain-specific data can erase the performance gap entirely, sometimes even flipping it. In a study of medical billing code extraction from 499,601 radiology reports, a fine-tuned 4-billion-parameter open-source model (MediPhi-Instruct 4B) achieved an F1-score of 87.79%, outperforming every proprietary model tested—including GPT-5, GPT-4.1, and Gemini 2.5 Flash—on a real-world sample of 500 reports. The fine-tuned model's F1 of 70.32% beat Gemini 2.5 Flash's 58.22% by a statistically significant margin [1]. This shows that a small, specialized open-source model can beat a much larger general-purpose proprietary one when trained on the right data.

The same pattern holds for medical evidence summarization. Fine-tuning open-source models like LongT5 on 8,161 pairs of systematic reviews and summaries brought their performance close to GPT-3.5's zero-shot results, and smaller fine-tuned models sometimes even outperformed larger zero-shot proprietary ones [7]. In ophthalmology question-answering, adding a retrieval-augmented generation (RAG) pipeline boosted open-source Llama-3's accuracy by 23.85 percentage points, nearly matching GPT-4-turbo's performance [4]. These results make a clear case: fine-tuning and RAG are powerful equalizers.

Where do proprietary models still hold a clear advantage?

On broad, zero-shot benchmarks that test general reasoning, coding, and multimodal understanding, proprietary models still maintain a lead. In gastroenterology clinical reasoning using board-style multiple-choice questions, the best proprietary model (o1-preview) scored 82.0% accuracy, while the best open-source model (Llama3.3-70b) reached only 65.7%—a gap of over 16 percentage points [2]. Similarly, on a benchmark requiring models to implement novel machine learning research code from 2024-2025 papers, Gemini-2.5-Pro-Preview (proprietary) led with a 37.3% success rate, while the best open-source models lagged behind [6].

In multidimensional student skill assessment, proprietary models GPT-4o and Claude 3.7 Sonnet achieved 84.0% and 88.0% accuracy respectively, significantly outperforming open-source alternatives [9]. And in multimodal understanding, the open-source InternVL 1.5 achieved state-of-the-art results on 8 of 18 benchmarks, but still didn't surpass GPT-4V on all tasks [3]. The pattern is consistent: for tasks requiring broad knowledge, complex reasoning, or handling of diverse inputs without task-specific tuning, proprietary models still have an edge.

What advantages do open-source models offer beyond raw performance?

Open-source models provide critical advantages in privacy, customization, and cost that proprietary models cannot match. In a hospital setting, a locally deployed open-source LLM with RAG achieved 92.3% top-10 retrieval accuracy on administrative documents, all while keeping sensitive patient data on-premises [8]. This is impossible with cloud-based proprietary models due to data privacy regulations. Similarly, in radiology report simplification, the open-source Llama-3-70b was rated non-inferior to leading proprietary models in 4 out of 5 quality categories, while offering full transparency and the ability to run locally [5].

Quantization techniques further amplify these advantages. In ophthalmology QA, 4-bit quantization of open-source models proved as effective as 8-bit while requiring half the computational resources, making them viable in resource-constrained environments [4]. The BioMistral model, fine-tuned from Mistral on biomedical data, achieved competitive performance against proprietary counterparts while being freely available for customization [11]. And DeepSeek LLM 67B Chat, an open-source model, was shown to surpass GPT-3.5 in open-ended evaluations [10]. These findings show that for many real-world applications—especially in healthcare, education, and resource-limited settings—open-source models are not just catching up, but are already the practical choice.

Sources used in this answer

Comparison of proprietary and fine-tuned large language models for multi-label classification of billing codes from radiology reports.

A fine-tuned 4B open-source model outperformed GPT-5 and Gemini 2.5 Flash in medical billing code extraction, achieving 70.32% F1 vs. 58.22% on real-world radiology reports.

2026 · Kamyar Arzideh, Henning Schäfer, Ahmad Idrissi-Yaghir, Bahadir Eryilmaz, Sina Warmer, Eva Maria Hartmann, Katarzyna Borys, Cynthia Sabrina Schmidt, Johannes Haubold, Lale Umutlu, Michael Forsting, Felix Nensa, René Hosch · European radiology

Original

Benchmarking proprietary and open-source language and vision-language models for gastroenterology clinical reasoning.

Proprietary models (o1-preview, 82.0%) outperformed open-source models (Llama3.3-70b, 65.7%) in gastroenterology clinical reasoning by over 16 percentage points.

2025 · Seyed Amir Ahmad Safavi-Naini, Shuhaib Ali, Omer Shahab, Zahra Shahhoseini, Thomas Savage, Sara Rafiee, Jamil S Samaan, Reem Al Shabeeb, Farah Ladak, Jamie O Yang, Juan Echavarria, Sumbal Babar, Aasma Shaukat, Samuel Margolis, Nicholas P Tatonetti, Girish Nadkarni, Bara El Kurdi, Ali Soroush · NPJ digital medicine

Original

How far are we to GPT-4V? Closing the gap to commercial multimodal models with open-source suites

Open-source InternVL 1.5 achieved state-of-the-art results on 8 of 18 multimodal benchmarks, narrowing the gap with proprietary models like GPT-4V.

2024 · Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, Ji Ma, Jiaqi Wang, Xiaoyi Dong, Hang Yan, Hewei Guo, Conghui He, Botian Shi, Zhenjiang Jin, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai, Licheng Wen, Xiangchao Yan, Min Dou, Lewei Lu, Xizhou Zhu, Tong Lu, Dahua Lin, Yu Qiao, Jifeng Dai, Wenhai Wang · Sci. China Inf. Sci.

Original

Advancing Question-Answering in Ophthalmology With Retrieval-Augmented Generation: Benchmarking Open-Source and Proprietary Large Language Models.

Adding RAG boosted open-source Llama-3's accuracy by 23.85% in ophthalmology QA, nearly matching GPT-4-turbo's performance.

2025 · Quang Nguyen, Duy-Anh Nguyen, Khang Dang, Siyin Liu, Sophia Y Wang, William A Woof, Peter B M Thomas, Praveen J Patel, Konstantinos Balaskas, Johan H Thygesen, Honghan Wu, Nikolas Pontikos · Translational vision science & technology

Original

Performance of open-source and proprietary large language models in generating patient-friendly radiology chest CT reports

Open-source Llama-3-70b was rated non-inferior to leading proprietary models in 4 of 5 quality categories for generating patient-friendly radiology reports.

2025 · Philipp Prucker, Felix Busch, Felix Dorfner, Christian J Mertens, Nadine Bayerl, Marcus R Makowski, Keno K Bressem, Lisa C Adams · Clinical imaging

Original

ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code

On novel ML research code implementation, the best proprietary model (Gemini-2.5-Pro-Preview) achieved 37.3% success, with open-source models trailing behind.

2025 · Tianyu Hua, Harper Hua, Violet Xiang, Benjamin Klieger, Sang T. Truong, Weixin Liang, Fan-Yun Sun, Nick Haber · arXiv.org

Original

Closing the gap between open source and commercial large language models for medical evidence summarization

Fine-tuning open-source LongT5 on medical summaries brought its performance close to GPT-3.5 zero-shot, with smaller fine-tuned models sometimes outperforming larger zero-shot ones.

2024 · Gongbo Zhang, Qiao Jin, Yiliang Zhou, Song Wang, Betina Idnay, Yiming Luo, Elizabeth Park, Jordan G. Nestor, Matthew E. Spotnitz, Ali Soroush, Thomas Campion, Zhiyong Lu, Chunhua Weng, Yifan Peng · NPJ digital medicine

Original

Evaluation of Chunking and Embedding Strategies for Local Document Retrieval Using an Open-Source LLM in a Hospital.

A locally deployed open-source RAG system achieved 92.3% top-10 retrieval accuracy on hospital administrative documents, enabling privacy-preserving information retrieval.

2025 · Jan Bossenz, Carlo Günzl, Fabian Berns, Annemarie Weise, Christian Jäger, Jan Kirchhoff, Jan Christoph, Christoph Demus · Studies in health technology and informatics

Original

Assessing Multiple Student Skill Dimensions Using Large Language Models

Proprietary models GPT-4o (84.0%) and Claude 3.7 Sonnet (88.0%) significantly outperformed open-source models in multidimensional student skill assessment.

2025 · H. K. Yuen, Hang Man, Ronnie Cheung · International Conference on Teaching, Assessment, and Learning for Engineering

Original

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

Open-source DeepSeek LLM 67B Chat surpassed GPT-3.5 in open-ended evaluations, demonstrating strong performance in code, math, and reasoning.

2024 · Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, C. Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, Huazuo Gao, Kaige Gao, Wenjun Gao, Ruiqi Ge, Kang Guan, Daya Guo, Jianzhong Guo, Guangbo Hao, Zhewen Hao, Ying He, Wen-Hui Hu, Panpan Huang, Erhang Li, Guowei Li, Jiashi Li, Yao Li, Y. K. Li, W. Liang, Fangyun Lin, A. Liu, Bo Liu, Wen Liu, Xiaodong Liu, Xin Liu, Yiyuan Liu, Haoyu Lu, Shanghao Lu, Fuli Luo, Shirong Ma, X. Nie, Tian Pei, Yishi Piao, Junjie Qiu, Hui Qu, Tongzheng Ren, Z. Ren, C. Ruan, Zhangli Sha, Zhihong Shao, Jun-Mei Song, Xuecheng Su, Jingxiang Sun, Yaofeng Sun, M. Tang, Bing-Li Wang, Peiyi Wang, Shiyu Wang, Yaohui Wang, Yongji Wang, Tong Wu, Yu Wu, Xin Xie, Zhenda Xie, Ziwei Xie, Yi Xiong, Hanwei Xu, R. X. Xu, Yanhong Xu, Dejian Yang, Yu-mei You, Shuiping Yu, Xin-yuan Yu, Bo Zhang, Haowei Zhang, Lecong Zhang, Liyue Zhang, Mingchuan Zhang, Minghu Zhang, Wentao Zhang, Yichao Zhang, Chenggang Zhao, Yao Zhao, Shangyan Zhou, Shunfeng Zhou, Qihao Zhu, Yuheng Zou · arXiv.org

Original

BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains

Open-source BioMistral, fine-tuned on biomedical data, achieved competitive performance against proprietary counterparts on 10 medical QA tasks.

2024 · Yanis Labrak, Adrien Bazoge, Emmanuel Morin, P. Gourraud, Mickael Rouvier, Richard Dufour · ACL (Findings)

Original