mlidea: Interactively Improving ML Data Preparation Code via “Shadow Pipelines”

Stefan Grafberger, Paul Groth, Sebastian Schelter

Research output: Contribution to journalConference articlepeer-review

Abstract

Data scientists develop ML pipelines in an iterative manner: they repeatedly screen a pipeline for potential issues, debug it, and then revise and improve its code according to their findings. However, this manual process is tedious and error-prone. To address this challenge, we propose to assist data scientists with automatically derived interactive suggestions for pipeline improvements during this development cycle. We demonstrate mlidea, a library to generate interactive suggestions with so-called shadow pipelines, hidden variants of the original pipeline that modify it to auto-detect potential issues, try out modifications for improvements, and suggest and explain these modifications to the user. Our system uses incremental view maintenance to enable data scientists to quickly iterate on their code and to ensure low-latency maintenance of the shadow pipelines. We demonstrate how our system improves code for various domains with three interactive shadow pipelines: fixing mislabeled rows, enhancing robustness against data quality problems, and improving pipeline performance on data slices with subpar predictions.

Original languageEnglish
Pages (from-to)5359-5362
Number of pages4
JournalProceedings of the VLDB Endowment
Volume18
Issue number12
DOIs
StatePublished - 2025
Event51st International Conference on Very Large Data Bases, VLDB 2025 - London, United Kingdom
Duration: Sep 1 2025Sep 5 2025

Fingerprint

Dive into the research topics of 'mlidea: Interactively Improving ML Data Preparation Code via “Shadow Pipelines”'. Together they form a unique fingerprint.

Cite this