How to use AI to remove duplicates from literature?
AI effectively removes duplicate literature by utilizing natural language processing and machine learning techniques to detect redundant documents. This approach is technically feasible and enhances research efficiency through automated screening.
Key principles involve comparing textual features such as abstracts, titles, and keywords using algorithms like TF-IDF or neural embeddings. Necessary conditions include standardized metadata and adequate preprocessing to ensure text quality. The scope covers journal articles, conference papers, and preprints, with precautions for avoiding false positives in similar but distinct studies. Validation metrics like precision and recall must be monitored during implementation.
Implementation begins with preprocessing raw literature data, including cleaning and normalization. Next, select and apply similarity algorithms—such as MinHash or BERT embeddings—to compute document similarities. Threshold-based clustering then groups near-identical records. Finally, manual verification resolves edge cases before exporting curated datasets. This workflow reduces manual screening time by 70-80% in systematic reviews and bibliometric studies, significantly accelerating evidence synthesis.
