TY - GEN
T1 - Directions Towards Efficient and Automated Data Wrangling with Large Language Models
AU - Zhang, Zeyu
AU - Groth, Paul
AU - Calixto, Iacer
AU - Schelter, Sebastian
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Data integration and cleaning have long been a key focus of the data management community. Recent research indicates the potential of large language models (LLMs) for such tasks. However, scaling and automating data wrangling with LLMs for real-world use cases poses additional challenges. Manual prompt engineering for example, is expensive and hard to operationalise, while full fine-tuning of LLMs incurs high compute and storage costs. Following up on previous work, we evaluate parameter-efficient fine-tuning (PEFT) methods for efficiently automating data wrangling with LLMs. We conduct a study of four popular PEFT methods on differently sized LLMs for ten benchmark tasks, where we find that PEFT methods achieve performance on-par with full fine-tuning, and that we can leverage small LLMs with negligible performance loss. However, even though such PEFT methods are parameter-efficient, they still incur high compute costs at training time and require labeled training data. We explore a zero-shot setting to further reduce deployment costs, and propose our vision for ZEROMATCH, a novel approach to zero-shot entity matching. It is based on maintaining a large number of pretrained LLM variants from different domains and intelligently selecting an appropriate variant at inference time.
AB - Data integration and cleaning have long been a key focus of the data management community. Recent research indicates the potential of large language models (LLMs) for such tasks. However, scaling and automating data wrangling with LLMs for real-world use cases poses additional challenges. Manual prompt engineering for example, is expensive and hard to operationalise, while full fine-tuning of LLMs incurs high compute and storage costs. Following up on previous work, we evaluate parameter-efficient fine-tuning (PEFT) methods for efficiently automating data wrangling with LLMs. We conduct a study of four popular PEFT methods on differently sized LLMs for ten benchmark tasks, where we find that PEFT methods achieve performance on-par with full fine-tuning, and that we can leverage small LLMs with negligible performance loss. However, even though such PEFT methods are parameter-efficient, they still incur high compute costs at training time and require labeled training data. We explore a zero-shot setting to further reduce deployment costs, and propose our vision for ZEROMATCH, a novel approach to zero-shot entity matching. It is based on maintaining a large number of pretrained LLM variants from different domains and intelligently selecting an appropriate variant at inference time.
KW - Data Wrangling
KW - Entity matching
KW - Large language models
KW - Parameter-efficient fine-tuning
UR - http://www.scopus.com/inward/record.url?scp=85197358159&partnerID=8YFLogxK
U2 - 10.1109/ICDEW61823.2024.00044
DO - 10.1109/ICDEW61823.2024.00044
M3 - Contribución a la conferencia
AN - SCOPUS:85197358159
T3 - Proceedings - 2024 IEEE 40th International Conference on Data Engineering Workshops, ICDEW 2024
SP - 301
EP - 304
BT - Proceedings - 2024 IEEE 40th International Conference on Data Engineering Workshops, ICDEW 2024
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 40th IEEE International Conference on Data Engineering Workshops, ICDEW 2024
Y2 - 13 May 2024 through 16 May 2024
ER -