A Deep Dive Into Cross-Dataset Entity Matching with Large and Small Language Models

Zeyu Zhang, Paul Groth, Iacer Calixto, Sebastian Schelter

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Entity matching (EM) is the problem of determining whether two data records refer to the same real-world entity. A particularly challenging scenario is cross-dataset entity matching, where the matcher has to work with an unseen target dataset for which no labelled examples are available. Cross-dataset EM is crucial in scenarios where a high level of automation is required, and where it is unlikely or impractical to force a domain expert to manually label training data. Recently, approaches based on language models have become popular for EM, and often promise impressive transfer capabilities. However, there is a lack of a comprehensive and systematic study of the cross-dataset EM capabilities of these recent approaches. It is unclear, which categories of language models are actually applicable in a crossdataset EM setting, how well current EM approaches perform when they are evaluated systematically under a cross-dataset setting, and what the relationship between the prediction quality and deployment cost of various large language model-based EM approaches is. We address these open questions with the first comprehensive and systematic study on cross-dataset entity matching, where we evaluate eight matchers on 11 benchmark datasets, cover a wide variety of model sizes and transfer learning approaches, and also explore and quantify the relation between prediction quality and deployment cost of the matching approaches. We find that fine-tuned small models can perform on par with prompted large models, that data-centric approaches outperform model-centric approaches and that approaches using well-performing small models can be deployed at an orders of magnitude lower cost than comparably performing approaches with large commercial models.

Original languageEnglish
Title of host publicationAdvances in Database Technology - EDBT
PublisherOpenProceedings.org
Pages922-934
Number of pages13
Edition3
ISBN (Electronic)9783893180981, 9783893180998
DOIs
StatePublished - Mar 10 2025
Externally publishedYes
Event28th International Conference on Extending Database Technology, EDBT 2025 - Barcelona, Spain
Duration: Mar 25 2025Mar 28 2025

Publication series

NameAdvances in Database Technology - EDBT
Number3
Volume28
ISSN (Electronic)2367-2005

Conference

Conference28th International Conference on Extending Database Technology, EDBT 2025
Country/TerritorySpain
CityBarcelona
Period03/25/2503/28/25

Fingerprint

Dive into the research topics of 'A Deep Dive Into Cross-Dataset Entity Matching with Large and Small Language Models'. Together they form a unique fingerprint.

Cite this