Data Skew Profiling using HPCC Systems

Harsh Mishra, S Jayant, Arjuna Chala, Dan Camper, G Shobha, Jyoti Shetty

Research output: Chapter in Book/Report/Conference proceedingChapter

Abstract

Over the last few decades, there has been a tremendous increase in the volume of data available for analysis in various domains. Although processing power has scaled up as well, it is well known that the rate of increase of data far supersedes the higher processing capabilities of modern processors. The natural consequence to the advent of big data was distribution of data across multiple nodes to facilitate not only storage but also parallel processing. The advent of the age of large volumes of data came to be known as the era of big data. The distribution of data among various machines posed a fundamental problem in big data as well as distributed computing: The impact of data skew. We worked on a project to profile data skew on a multi-computing cluster. This paper summarizes our efforts and findings. We use HPCC Systems, a modern big data management and analysis tool. In this project, we analyze the impact of differently skewed data distributions on the most common database operations, namely, NORMALIZE, DENORMALIZE, JOIN, SORT, TABLE, and PROJECT using a set of queries, and analyzing their runtimes.

Original languageAmerican English
Title of host publicationICBDE 2019 - 2019 International Conference on Big Data and Education
Pages66-69
Number of pages4
ISBN (Electronic)9781450361866
DOIs
StatePublished - Mar 30 2019

Publication series

NameACM International Conference Proceeding Series

Fingerprint

Processing
Cluster computing
Distributed computer systems
Information management
Big data

Keywords

  • Big data
  • Clusters
  • HPCC
  • Multicomputing
  • Skew

Cite this

Mishra, H., Jayant, S., Chala, A., Camper, D., Shobha, G., & Shetty, J. (2019). Data Skew Profiling using HPCC Systems. In ICBDE 2019 - 2019 International Conference on Big Data and Education (pp. 66-69). (ACM International Conference Proceeding Series). https://doi.org/10.1145/3322134.3322142
Mishra, Harsh ; Jayant, S ; Chala, Arjuna ; Camper, Dan ; Shobha, G ; Shetty, Jyoti. / Data Skew Profiling using HPCC Systems. ICBDE 2019 - 2019 International Conference on Big Data and Education. 2019. pp. 66-69 (ACM International Conference Proceeding Series).
@inbook{e498e5ad6ef541c0ab7885586a2a2748,
title = "Data Skew Profiling using HPCC Systems",
abstract = "Over the last few decades, there has been a tremendous increase in the volume of data available for analysis in various domains. Although processing power has scaled up as well, it is well known that the rate of increase of data far supersedes the higher processing capabilities of modern processors. The natural consequence to the advent of big data was distribution of data across multiple nodes to facilitate not only storage but also parallel processing. The advent of the age of large volumes of data came to be known as the era of big data. The distribution of data among various machines posed a fundamental problem in big data as well as distributed computing: The impact of data skew. We worked on a project to profile data skew on a multi-computing cluster. This paper summarizes our efforts and findings. We use HPCC Systems, a modern big data management and analysis tool. In this project, we analyze the impact of differently skewed data distributions on the most common database operations, namely, NORMALIZE, DENORMALIZE, JOIN, SORT, TABLE, and PROJECT using a set of queries, and analyzing their runtimes.",
keywords = "Big data, Clusters, HPCC, Multicomputing, Skew",
author = "Harsh Mishra and S Jayant and Arjuna Chala and Dan Camper and G Shobha and Jyoti Shetty",
year = "2019",
month = "3",
day = "30",
doi = "https://doi.org/10.1145/3322134.3322142",
language = "American English",
isbn = "9781450361866",
series = "ACM International Conference Proceeding Series",
pages = "66--69",
booktitle = "ICBDE 2019 - 2019 International Conference on Big Data and Education",

}

Mishra, H, Jayant, S, Chala, A, Camper, D, Shobha, G & Shetty, J 2019, Data Skew Profiling using HPCC Systems. in ICBDE 2019 - 2019 International Conference on Big Data and Education. ACM International Conference Proceeding Series, pp. 66-69. https://doi.org/10.1145/3322134.3322142

Data Skew Profiling using HPCC Systems. / Mishra, Harsh; Jayant, S; Chala, Arjuna; Camper, Dan; Shobha, G; Shetty, Jyoti.

ICBDE 2019 - 2019 International Conference on Big Data and Education. 2019. p. 66-69 (ACM International Conference Proceeding Series).

Research output: Chapter in Book/Report/Conference proceedingChapter

TY - CHAP

T1 - Data Skew Profiling using HPCC Systems

AU - Mishra, Harsh

AU - Jayant, S

AU - Chala, Arjuna

AU - Camper, Dan

AU - Shobha, G

AU - Shetty, Jyoti

PY - 2019/3/30

Y1 - 2019/3/30

N2 - Over the last few decades, there has been a tremendous increase in the volume of data available for analysis in various domains. Although processing power has scaled up as well, it is well known that the rate of increase of data far supersedes the higher processing capabilities of modern processors. The natural consequence to the advent of big data was distribution of data across multiple nodes to facilitate not only storage but also parallel processing. The advent of the age of large volumes of data came to be known as the era of big data. The distribution of data among various machines posed a fundamental problem in big data as well as distributed computing: The impact of data skew. We worked on a project to profile data skew on a multi-computing cluster. This paper summarizes our efforts and findings. We use HPCC Systems, a modern big data management and analysis tool. In this project, we analyze the impact of differently skewed data distributions on the most common database operations, namely, NORMALIZE, DENORMALIZE, JOIN, SORT, TABLE, and PROJECT using a set of queries, and analyzing their runtimes.

AB - Over the last few decades, there has been a tremendous increase in the volume of data available for analysis in various domains. Although processing power has scaled up as well, it is well known that the rate of increase of data far supersedes the higher processing capabilities of modern processors. The natural consequence to the advent of big data was distribution of data across multiple nodes to facilitate not only storage but also parallel processing. The advent of the age of large volumes of data came to be known as the era of big data. The distribution of data among various machines posed a fundamental problem in big data as well as distributed computing: The impact of data skew. We worked on a project to profile data skew on a multi-computing cluster. This paper summarizes our efforts and findings. We use HPCC Systems, a modern big data management and analysis tool. In this project, we analyze the impact of differently skewed data distributions on the most common database operations, namely, NORMALIZE, DENORMALIZE, JOIN, SORT, TABLE, and PROJECT using a set of queries, and analyzing their runtimes.

KW - Big data

KW - Clusters

KW - HPCC

KW - Multicomputing

KW - Skew

UR - http://www.scopus.com/inward/record.url?scp=85066091948&partnerID=8YFLogxK

U2 - https://doi.org/10.1145/3322134.3322142

DO - https://doi.org/10.1145/3322134.3322142

M3 - Chapter

SN - 9781450361866

T3 - ACM International Conference Proceeding Series

SP - 66

EP - 69

BT - ICBDE 2019 - 2019 International Conference on Big Data and Education

ER -

Mishra H, Jayant S, Chala A, Camper D, Shobha G, Shetty J. Data Skew Profiling using HPCC Systems. In ICBDE 2019 - 2019 International Conference on Big Data and Education. 2019. p. 66-69. (ACM International Conference Proceeding Series). https://doi.org/10.1145/3322134.3322142