How different is different? Systematically identifying distribution shifts and their impacts in NER datasets

Xue Li, Paul Groth

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

When processing natural language, we are frequently confronted with the problem of distribution shift. For example, using a model trained on a news corpus to subsequently process legal text exhibits reduced performance. While this problem is well-known, to this point, there has not been a systematic study of detecting shifts and investigating the impact shifts have on model performance for NLP tasks. Therefore, in this paper, we detect and measure two types of distribution shift, across three different representations, for 12 benchmark Named Entity Recognition datasets. We show that both input shift and label shift can lead to dramatic performance degradation. For example, fine-tuning on a wide spectrum dataset (OntoNotes) and testing on an email dataset (CEREC) that shares labels leads to a 63-points drop in F1 performance. Overall, our results indicate that the measurement of distribution shift can provide guidance to the amount of data needed for fine-tuning and whether or not a model can be used “off-the-shelf” without subsequent fine-tuning. Finally, our results show that shift measurement can play an important role in NLP model pipeline definition.

Original languageEnglish
Pages (from-to)1111-1150
Number of pages40
JournalLanguage Resources and Evaluation
Volume59
Issue number2
DOIs
StatePublished - Jun 2025
Externally publishedYes

Keywords

  • Distribution shift
  • Named entity recognition

Fingerprint

Dive into the research topics of 'How different is different? Systematically identifying distribution shifts and their impacts in NER datasets'. Together they form a unique fingerprint.

Cite this