Automated Data Skew Profiler

  • Mishra, Harsh (CoI)
  • Jayant, S (CoI)

Description

The objective of the project is to analyze the impact of differently skewed data distributions on the most common database operations, namely, NORMALIZE, DENORMALIZE, JOIN, SORT, TABLE, and PROJECT using a set of queries, and analyzing their runtimes, and also to estimate the effective performance skew of a set of queries based on the data skew of the dataset on a multi-computing cluster The project aims to automate the process of skew prediction by analyzing the execution graphs of a job on the HPCC Systems cluster and predicting the probable performance skew for a given set of queries using a Random Forest Regressor Model.
StatusFinished
Effective start/end date01/1/1812/31/18