TY - JOUR
T1 - How different is different? Systematically identifying distribution shifts and their impacts in NER datasets
AU - Li, Xue
AU - Groth, Paul
N1 - Publisher Copyright:
© The Author(s) 2024.
PY - 2025/6
Y1 - 2025/6
N2 - When processing natural language, we are frequently confronted with the problem of distribution shift. For example, using a model trained on a news corpus to subsequently process legal text exhibits reduced performance. While this problem is well-known, to this point, there has not been a systematic study of detecting shifts and investigating the impact shifts have on model performance for NLP tasks. Therefore, in this paper, we detect and measure two types of distribution shift, across three different representations, for 12 benchmark Named Entity Recognition datasets. We show that both input shift and label shift can lead to dramatic performance degradation. For example, fine-tuning on a wide spectrum dataset (OntoNotes) and testing on an email dataset (CEREC) that shares labels leads to a 63-points drop in F1 performance. Overall, our results indicate that the measurement of distribution shift can provide guidance to the amount of data needed for fine-tuning and whether or not a model can be used “off-the-shelf” without subsequent fine-tuning. Finally, our results show that shift measurement can play an important role in NLP model pipeline definition.
AB - When processing natural language, we are frequently confronted with the problem of distribution shift. For example, using a model trained on a news corpus to subsequently process legal text exhibits reduced performance. While this problem is well-known, to this point, there has not been a systematic study of detecting shifts and investigating the impact shifts have on model performance for NLP tasks. Therefore, in this paper, we detect and measure two types of distribution shift, across three different representations, for 12 benchmark Named Entity Recognition datasets. We show that both input shift and label shift can lead to dramatic performance degradation. For example, fine-tuning on a wide spectrum dataset (OntoNotes) and testing on an email dataset (CEREC) that shares labels leads to a 63-points drop in F1 performance. Overall, our results indicate that the measurement of distribution shift can provide guidance to the amount of data needed for fine-tuning and whether or not a model can be used “off-the-shelf” without subsequent fine-tuning. Finally, our results show that shift measurement can play an important role in NLP model pipeline definition.
KW - Distribution shift
KW - Named entity recognition
UR - http://www.scopus.com/inward/record.url?scp=85198915099&partnerID=8YFLogxK
U2 - 10.1007/s10579-024-09754-8
DO - 10.1007/s10579-024-09754-8
M3 - Artículo
AN - SCOPUS:85198915099
SN - 1574-020X
VL - 59
SP - 1111
EP - 1150
JO - Language Resources and Evaluation
JF - Language Resources and Evaluation
IS - 2
ER -