How much can you actually shrink an LLM without breaking it?
The headline numbers are striking. One study compressed a 13.5 GB teacher model down to 264 MB — a 51.1× reduction — while the student still achieved 96.2% of the teacher's performance on industrial IoT tasks like predictive maintenance and fault diagnosis [8]. That means a model that originally required a server-grade GPU can run on a Raspberry Pi or NVIDIA Jetson Nano, with inference latency cut by 24.7×, enabling real-time processing on devices that cost under $100.
Even more dramatic compression is possible for vision-language models. EdgeVL distilled a CLIP-style model down to a 93-fold smaller size while improving open-vocabulary classification accuracy by up to 15.4% on multiple datasets [6]. The trick was combining dual-modality knowledge distillation with quantization-aware training, so the student didn't just mimic outputs but also preserved feature quality after compression.
For pure language tasks, a Korean language model was shrunk from 432 MB (110 million parameters) to just 18 MB (4 million parameters) — an 8.15× reduction — yet it retained over 97% of the teacher's performance across six NLP benchmarks [7]. Remarkably, on sentiment classification, the tiny student actually beat the teacher, achieving 89.72% accuracy versus the teacher's score. This suggests that for some tasks, a distilled model can even generalize better by avoiding overfitting.
What's the catch? When does distillation fail to deliver?
The glowing numbers above come from carefully controlled studies. In practice, distillation's effectiveness depends heavily on the task, the teacher-student architecture gap, and the deployment environment. A 2025 survey of edge deployment techniques found that while knowledge distillation can achieve a 4000× parameter reduction with 'comparable performance,' the word 'comparable' hides significant variation — on some tasks, performance degradation is real and measurable [1].
One major issue is 'knowledge mismatch' between teacher and student. A study on road-side traffic perception found that standard distillation often leaves the student model with redundant knowledge while missing critical information [3]. Their solution — a bidirectional knowledge interaction mechanism with a PID control algorithm — improved mean average precision (mAP50) by only 1.17% over the baseline student model (from 77.84% to 79.01%). That's a modest gain, suggesting that naive distillation can leave significant performance on the table.
Another limitation: decoder-only models (like GPT-style architectures) are much harder to compress than encoder models. An industry-focused study found that while encoder models could be aggressively distilled for commercial edge applications, decoder models 'are resistant to a comparable degree of compression' [9]. This matters because most modern LLMs for text generation are decoder-only. The paper's solution — multistage low-rank fine-tuning — helped but required careful slicing of the decoder, and the compression gains were less dramatic than for encoders.
Security is another hidden cost. A 2025 study on federated learning showed that distilled models on edge devices are actually more vulnerable to backdoor attacks — the distillation process can inadvertently help attackers embed malicious patterns, with attack success rates jumping up to 75.4% higher than on non-distilled models [2]. Even under defense mechanisms, the attack success rate remained above 90% in some cases. So if you're deploying a distilled model in a security-sensitive application (e.g., autonomous driving or healthcare), you need additional safeguards.
What actually makes distillation work on a phone or a Raspberry Pi?
The best results come from combining distillation with other compression tricks, not using it alone. A skin disease diagnosis system used multi-teacher knowledge distillation (MTAKD) and achieved a student model that was 49.8× smaller and 352× faster than the teacher, while maintaining 87.53% accuracy on the ISIC 2019 dataset [4]. The key innovation was using 'dynamic teacher agreement' — weighting knowledge from multiple teacher models based on how much they agreed on each input — which improved accuracy by 0.75% over the best existing framework.
Quantization is the most common partner. Edge-LLM, a collaborative framework for serving LLMs on edge devices, combined adaptive quantization with a cache mechanism and a value-density-first scheduling algorithm [5]. The result: 17× faster overall computation, 63% fewer task timeouts, and 43% less GPU overhead. The quantization step alone reduced memory footprint by up to 75% while maintaining accuracy [1].
Hardware-aware co-optimization takes this further. The EdgeDistill framework introduced a 'Hardware-Perception Adaptive Quantization-Distillation' module that simultaneously performs mixed-precision quantization and knowledge distillation in a single training pass [8]. This means the student model is compressed and knowledge-enhanced at the same time, tailored to the specific memory and latency constraints of the target device (e.g., Jetson Nano vs. Raspberry Pi). The result was a 51.1× compression with only 3.8% performance drop — a practical sweet spot.
For real-time applications like machine health prognosis, online distillation — where the student learns continuously from the teacher during deployment — can be more effective than one-shot distillation. A 2024 study showed that simple student networks could match complex teacher networks on fault prediction tasks after online distillation, using response-based, feature-based, and relation-based knowledge transfer modules [10]. The adaptive mutual learning strategy they used accounted for the inherent differences between simple and complex networks, preventing the student from being overwhelmed.
Sources used in this answer
Edge intelligence unleashed: a survey on deploying large language models in resource-constrained environments
A 2025 survey found that knowledge distillation can achieve 4000× parameter reduction with comparable performance, while quantization and pruning reduce memory footprint by up to 75% with minimal accuracy loss.
LBKD: Rethinking Federated Backdoors for Low-Altitude Economy via LLMs and Bidirectional Knowledge Distillation
LBKD showed that distilled edge models are more vulnerable to backdoor attacks, with attack success rates up to 75.4% higher than existing methods, and remaining above 90% even under defense mechanisms.
An acquisitive knowledge distillation method for deployment at the edge of the road
KAKD improved traffic perception mAP50 by only 1.17% over the base YOLOv8n model, showing that standard distillation suffers from knowledge mismatch and limited gains on complex visual tasks.
MTAKD: multi-teacher agreement knowledge distillation for edge AI skin disease diagnosis.
MTAKD achieved a student model 49.8× smaller and 352× faster than the teacher, with 87.53% accuracy on skin disease diagnosis — 0.75% above the best existing framework.
Edge-LLM: A Collaborative Framework for Large Language Model Serving in Edge Computing
Edge-LLM combined adaptive quantization with scheduling to achieve 17× faster computation, 63% fewer timeouts, and 43% less GPU overhead on edge devices.
Self-Adapting Large Visual-Language Models to Edge Devices across Visual Modalities
EdgeVL achieved up to 93-fold model size reduction and 15.4% accuracy improvements on vision-language tasks by combining dual-modality distillation with quantization-aware contrastive learning.
Lightweight Pre-Trained Korean Language Model Based on Knowledge Distillation and Low-Rank Factorization
A Korean language model was shrunk 8.15× (from 432 MB to 18 MB) while retaining 97.4% of teacher performance, and even surpassed the teacher on sentiment classification (89.72% vs. teacher's score).
EdgeDistill: A Knowledge Distillation Approach for Deploying Large Language Models on Resource-Constrained Edge Devices in Industrial IoT
EdgeDistill compressed a 13.5 GB LLM to 264 MB (51.1× reduction) while achieving 96.2% of teacher performance and 24.7× faster inference on Jetson Nano and Raspberry Pi.
Efficiently Distilling LLMs for Edge Applications
Decoder-only models are resistant to compression compared to encoders; MLFS achieved high-quality encoder models for edge but required careful slicing for decoders.
Online Knowledge Distillation for Machine Health Prognosis Considering Edge Deployment
Online knowledge distillation with adaptive mutual learning enabled simple student networks to match complex teacher networks on machine health prognosis tasks, enabling edge deployment.
