A data- and compute-efficient chest X-ray foundation model beyond aggressive scaling

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

A data- and compute-efficient chest X-ray foundation model beyond aggressive scaling

[Nature Communications 2025] CheXficient: Beyond Aggressive Scaling with Intelligent Data Curation for Chest X-ray Foundation Models

总结

问题

方法

结果

要点

摘要

CheXficient is a chest X-ray (CXR) foundation model that breaks the "scale-at-all-costs" paradigm by employing a prototype-driven online data curation strategy. By training on only 22.7% of a 1.23M image-report corpus and using less than 27.3% of the compute budget, it achieves SOTA results across 20 benchmarks, outperforming models trained on 50x more data.

Executive Summary

TL;DR: Researchers from Stanford University have challenged the "bigger is better" dogma in medical AI by introducing CheXficient. This foundation model uses just 22.7% of available data and 27.3% of the compute budget to outperform models trained on tens of millions of images. By selectively "curating" informative samples rather than blindly inhaling entire datasets, CheXficient proves that data quality and diversity trump sheer volume in diagnostic accuracy.

Background Placement: This work represents a strategic shift from data-intensive to data-efficient AI. It sits at the intersection of active learning and foundation models, providing a blueprint for institutions to build SOTA medical AI without industrial-scale supercomputing clusters.

Problem & Motivation: The "Normal" Noise Floor

Most medical datasets are plagued by a long-tail distribution. In Chest X-rays, "Normal" findings and common pathologies dominate the data, while rare but life-threatening conditions (the "long tail") are barely represented.

Redundancy: Training repeatedly on thousands of near-identical "Normal" scans yields diminishing returns.
Inefficiency: Models like MedGemma utilize thousands of TPU-hours. This "scale-at-all-costs" approach is environmentally and financially unsustainable for most clinical researchers.

The authors' insight: Not all data points are created equal. A rare case of pulmonary fibrosis is arguably worth a hundred "Normal" scans for a model's representation learning.

Methodology: The Prototype-Driven Curator

CheXficient replaces random mini-batching with an Online Data Curator. Here is how the "intelligence" works:

Multimodal Embedding: Images and reports are mapped into a shared latent space using DINOv2 and BioClinicalBERT.
Prototype Evolution: The model maintains $K$ "prototypes" (centroids) that represent common clusters in the data.
The Selection Logic:
- Distance-based Prioritization: Samples farthest from any prototype (outliers/rare cases) are kept.
- Diversity-based Under-sampling: For samples near prototypes (redundant cases), the model uses Farthest Point Sampling (FPS) to keep only a diverse handful, discarding the rest.

CheXficient Architecture and Strategy

Experiments & Results: Efficient Dominance

The model was tested across 20 benchmarks, including zero-shot classification, segmentation, and report generation.

1. Data & Compute Efficiency

CheXficient reached the performance of the full-data model (CheXfull) using only 280K samples. Specifically, it saved roughly 73% to 82% of H100 GPU-hours. At matched compute budgets, CheXficient consistently dominated random sampling techniques.

2. SOTA Comparison

Despite its small training footprint, CheXficient outperformed industrial giants. For instance:

Report Generation: Outperformed RadFM (16M pairs) and Libra (1.2M pairs) on the ReXGradient-160K benchmark.
Zero-shot Classification: On seen and unseen datasets, it maintained high AUROC scores, often surpassing models with 50x more parameters/data.

Performance Benchmarks

3. Taming the Long Tail

Analysis of the feature space showed that the curated subset systematically captures the "low-density" regions of the data distribution. By up-weighting rare diseases, CheXficient showed superior performance on conditions like Aortic Enlargement and Pulmonary Fibrosis compared to standard scaling.

Long-tail Generalization

Critical Analysis & Conclusion

Takeaway: CheXficient demonstrates that the bottleneck for medical AI isn't the quantity of data but the informativeness of it. The success of this prototype-based selection suggests that a "small-data" foundation model is not only possible but potentially more robust against bias.

Limitations:

Extreme Rarities: For diseases representing <0.1% of the data, curation alone isn't enough; targeted data acquisition is still needed.
Architecture: The study focused on ViT-Base; scaling the backbone (e.g., to ViT-Large) might offer even more gains when paired with curation.

Future Outlook: This methodology can be extended to 3D volumes (CT/MRI) and multi-modal tasks like Visual Question Answering (VQA). For the broader community, it signals the end of the "data-brute" era and the beginning of "curated intelligence."

发现相似论文

试试这些示例

Find recent papers on active learning or data pruning strategies specifically designed for multimodal medical foundation models to improve training efficiency.
Which study first introduced the concept of using prototypes for data curation in contrastive learning, and how does CheXficient's online update mechanism differ from that original approach?
Explore research that applies similar prototype-driven data selection to 3D medical imaging tasks like CT or MRI to mitigate the "curse of dimensionality" and data scarcity.

[Nature Communications 2025] CheXficient: Beyond Aggressive Scaling with Intelligent Data Curation for Chest X-ray Foundation Models

1. Executive Summary

2. Problem & Motivation: The "Normal" Noise Floor

3. Methodology: The Prototype-Driven Curator

4. Experiments & Results: Efficient Dominance

4.1. 1. Data & Compute Efficiency

4.2. 2. SOTA Comparison

4.3. 3. Taming the Long Tail

5. Critical Analysis & Conclusion