TY - GEN
T1 - A Deep Dive Into Cross-Dataset Entity Matching with Large and Small Language Models
AU - Zhang, Zeyu
AU - Groth, Paul
AU - Calixto, Iacer
AU - Schelter, Sebastian
N1 - Publisher Copyright:
© 2025 OpenProceedings.org. All rights reserved.
PY - 2025/3/10
Y1 - 2025/3/10
N2 - Entity matching (EM) is the problem of determining whether two data records refer to the same real-world entity. A particularly challenging scenario is cross-dataset entity matching, where the matcher has to work with an unseen target dataset for which no labelled examples are available. Cross-dataset EM is crucial in scenarios where a high level of automation is required, and where it is unlikely or impractical to force a domain expert to manually label training data. Recently, approaches based on language models have become popular for EM, and often promise impressive transfer capabilities. However, there is a lack of a comprehensive and systematic study of the cross-dataset EM capabilities of these recent approaches. It is unclear, which categories of language models are actually applicable in a crossdataset EM setting, how well current EM approaches perform when they are evaluated systematically under a cross-dataset setting, and what the relationship between the prediction quality and deployment cost of various large language model-based EM approaches is. We address these open questions with the first comprehensive and systematic study on cross-dataset entity matching, where we evaluate eight matchers on 11 benchmark datasets, cover a wide variety of model sizes and transfer learning approaches, and also explore and quantify the relation between prediction quality and deployment cost of the matching approaches. We find that fine-tuned small models can perform on par with prompted large models, that data-centric approaches outperform model-centric approaches and that approaches using well-performing small models can be deployed at an orders of magnitude lower cost than comparably performing approaches with large commercial models.
AB - Entity matching (EM) is the problem of determining whether two data records refer to the same real-world entity. A particularly challenging scenario is cross-dataset entity matching, where the matcher has to work with an unseen target dataset for which no labelled examples are available. Cross-dataset EM is crucial in scenarios where a high level of automation is required, and where it is unlikely or impractical to force a domain expert to manually label training data. Recently, approaches based on language models have become popular for EM, and often promise impressive transfer capabilities. However, there is a lack of a comprehensive and systematic study of the cross-dataset EM capabilities of these recent approaches. It is unclear, which categories of language models are actually applicable in a crossdataset EM setting, how well current EM approaches perform when they are evaluated systematically under a cross-dataset setting, and what the relationship between the prediction quality and deployment cost of various large language model-based EM approaches is. We address these open questions with the first comprehensive and systematic study on cross-dataset entity matching, where we evaluate eight matchers on 11 benchmark datasets, cover a wide variety of model sizes and transfer learning approaches, and also explore and quantify the relation between prediction quality and deployment cost of the matching approaches. We find that fine-tuned small models can perform on par with prompted large models, that data-centric approaches outperform model-centric approaches and that approaches using well-performing small models can be deployed at an orders of magnitude lower cost than comparably performing approaches with large commercial models.
UR - http://www.scopus.com/inward/record.url?scp=105007872589&partnerID=8YFLogxK
U2 - 10.48786/edbt.2025.75
DO - 10.48786/edbt.2025.75
M3 - Contribución a la conferencia
AN - SCOPUS:105007872589
T3 - Advances in Database Technology - EDBT
SP - 922
EP - 934
BT - Advances in Database Technology - EDBT
PB - OpenProceedings.org
T2 - 28th International Conference on Extending Database Technology, EDBT 2025
Y2 - 25 March 2025 through 28 March 2025
ER -