Skip to main navigation Skip to search Skip to main content

Annotating and indexing scientific articles with rare diseases

Research output: Contribution to journalArticlepeer-review

Abstract

Background: Around 30 million people in Europe are affected by a rare (or orphan) disease, defined as a condition occurring in fewer than 1 in 2,000 individuals. The primary challenge is to automatically and efficiently identify scientific articles and guidelines that address a particular rare disease. We present a novel methodology to annotate and index scientific text with taxonomical concepts describing rare diseases from the OrphaNet taxonomy. This task is complicated by several technical challenges, including the lack of sufficiently large, human-annotated datasets for supervised training and the polysemy/synonymy and surface-form variation of rare disease names, which can hinder any annotation engine. Results: We introduce a framework that operationalizes OrphaNet for large-scale literature annotation by integrating the TERMite engine with curated synonym expansion, label normalization (including deprecated/renamed concepts), and fuzzy matching. On benchmark datasets, the approach achieves precision = 92%, recall = 75%, and F1 = 83%, outperforming an string-matching baseline. Applying the pipeline to Scopus produces disease-specific corpora suitable for bibliometric and scientometric analyses (e.g., institution, country, and subject-area profiles). These outputs power the Rare Diseases Monitor dashboard for exploring national and global research activity. Conclusion: To our knowledge, this is the first systematic, scalable semantic framework for annotating and indexing rare disease literature at scale. By operationalizing OrphaNet in an automated, reproducible pipeline and addressing data scarcity and lexical variability, the work advances biomedical semantics for rare diseases and enables disease-centric monitoring, evaluation, and discovery across the research landscape.

Original languageEnglish
Article number3
JournalJournal of Biomedical Semantics
Volume17
Issue number1
DOIs
StatePublished - Dec 2026

Keywords

  • Annotation
  • Bibliographic databases
  • Health sciences
  • Indexing
  • Natural language processing
  • Rare diseases
  • Research applications
  • Scientometrics

Fingerprint

Dive into the research topics of 'Annotating and indexing scientific articles with rare diseases'. Together they form a unique fingerprint.

Cite this