Automated Data Skew Profiler

  • Mishra, Harsh (CoI)
  • Jayant, S (CoI)


The objective of the project is to analyze the impact of differently skewed data distributions on the most common database operations, namely, NORMALIZE, DENORMALIZE, JOIN, SORT, TABLE, and PROJECT using a set of queries, and analyzing their runtimes, and also to estimate the effective performance skew of a set of queries based on the data skew of the dataset on a multi-computing cluster The project aims to automate the process of skew prediction by analyzing the execution graphs of a job on the HPCC Systems cluster and predicting the probable performance skew for a given set of queries using a Random Forest Regressor Model.
Effective start/end date01/1/1812/31/18