Towards Interactively Improving ML Data Preparation Code via "Shadow Pipelines"

Stefan Grafberger, Paul Groth, Sebastian Schelter

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Data scientists develop ML pipelines in an iterative manner: they repeatedly screen a pipeline for potential issues, debug it, and then revise and improve its code according to their findings. However, this manual process is tedious and error-prone. Therefore, we propose to support data scientists during this development cycle with automatically derived interactive suggestions for pipeline improvements. We discuss our vision to generate these suggestions with so-called shadow pipelines, hidden variants of the original pipeline that modify it to auto-detect potential issues, try out modifications for improvements, and suggest and explain these modifications to the user. We envision to apply incremental view maintenance-based optimisations to ensure low-latency computation and maintenance of the shadow pipelines. We conduct preliminary experiments to showcase the feasibility of our envisioned approach and the potential benefits of our proposed optimisations.

Original languageEnglish
Title of host publicationProceedings of the 8th Workshop on Data Management for End-to-End Machine Learning, DEEM 2024 - In conjunction with the 2024 ACM SIGMOD/PODS Conference
PublisherAssociation for Computing Machinery, Inc
Pages7-11
Number of pages5
ISBN (Electronic)9798400706110
DOIs
StatePublished - Jun 9 2024
Event8th Workshop on Data Management for End-to-End Machine Learning, DEEM 2024 - Santiago, Chile
Duration: Jun 9 2024Jun 9 2024

Publication series

NameProceedings of the 8th Workshop on Data Management for End-to-End Machine Learning, DEEM 2024 - In conjunction with the 2024 ACM SIGMOD/PODS Conference

Conference

Conference8th Workshop on Data Management for End-to-End Machine Learning, DEEM 2024
Country/TerritoryChile
CitySantiago
Period06/9/2406/9/24

Fingerprint

Dive into the research topics of 'Towards Interactively Improving ML Data Preparation Code via "Shadow Pipelines"'. Together they form a unique fingerprint.

Cite this