Directions Towards Efficient and Automated Data Wrangling with Large Language Models

Zeyu Zhang, Paul Groth, Iacer Calixto, Sebastian Schelter

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Data integration and cleaning have long been a key focus of the data management community. Recent research indicates the potential of large language models (LLMs) for such tasks. However, scaling and automating data wrangling with LLMs for real-world use cases poses additional challenges. Manual prompt engineering for example, is expensive and hard to operationalise, while full fine-tuning of LLMs incurs high compute and storage costs. Following up on previous work, we evaluate parameter-efficient fine-tuning (PEFT) methods for efficiently automating data wrangling with LLMs. We conduct a study of four popular PEFT methods on differently sized LLMs for ten benchmark tasks, where we find that PEFT methods achieve performance on-par with full fine-tuning, and that we can leverage small LLMs with negligible performance loss. However, even though such PEFT methods are parameter-efficient, they still incur high compute costs at training time and require labeled training data. We explore a zero-shot setting to further reduce deployment costs, and propose our vision for ZEROMATCH, a novel approach to zero-shot entity matching. It is based on maintaining a large number of pretrained LLM variants from different domains and intelligently selecting an appropriate variant at inference time.

Original languageEnglish
Title of host publicationProceedings - 2024 IEEE 40th International Conference on Data Engineering Workshops, ICDEW 2024
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages301-304
Number of pages4
ISBN (Electronic)9798350317152
DOIs
StatePublished - 2024
Externally publishedYes
Event40th IEEE International Conference on Data Engineering Workshops, ICDEW 2024 - Utrecht, Netherlands
Duration: May 13 2024May 16 2024

Publication series

NameProceedings - 2024 IEEE 40th International Conference on Data Engineering Workshops, ICDEW 2024

Conference

Conference40th IEEE International Conference on Data Engineering Workshops, ICDEW 2024
Country/TerritoryNetherlands
CityUtrecht
Period05/13/2405/16/24

Keywords

  • Data Wrangling
  • Entity matching
  • Large language models
  • Parameter-efficient fine-tuning

Fingerprint

Dive into the research topics of 'Directions Towards Efficient and Automated Data Wrangling with Large Language Models'. Together they form a unique fingerprint.

Cite this