Massively Scalable Parallel KMeans on the HPCC Systems Platform

Lili Xu, Amy Apon, Flavio Villanustre, Roger Dev, Arjuna Chala

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

3 Scopus citations

Abstract

Clustering algorithms are an important part of unsupervised machine learning. With Big Data, applying clustering algorithms such as KMeans has become a challenge due to the significantly larger volume of data and the computational complexity of the standard approach, Lloyd's algorithm. This work aims to tackle this challenge by transforming the classic clustering KMeans algorithm to be highly scalable and to be able to operate on Big Data. We leverage the distributed computing environment of the HPCC Systems platform. The presented KMeans algorithm adopts a hybrid parallelism method to achieve a massively scalable parallel KMeans. Our approach can save a significant amount of time of researchers and machine learning practitioners who train hundreds of models on a daily basis. The performance is evaluated with different size datasets and clusters and the results show a significant scalabilty of the scalable parallel KMeans algorithm.

Original languageEnglish
Title of host publicationCSITSS 2019 - 2019 4th International Conference on Computational Systems and Information Technology for Sustainable Solution, Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781728126197
DOIs
StatePublished - Dec 2019
Event4th International Conference on Computational Systems and Information Technology for Sustainable Solution, CSITSS 2019 - Bengaluru, India
Duration: Dec 20 2019Dec 21 2019

Publication series

NameCSITSS 2019 - 2019 4th International Conference on Computational Systems and Information Technology for Sustainable Solution, Proceedings

Conference

Conference4th International Conference on Computational Systems and Information Technology for Sustainable Solution, CSITSS 2019
Country/TerritoryIndia
CityBengaluru
Period12/20/1912/21/19

Keywords

  • HPCC Systems
  • High Performance Computing
  • Hybrid Parallelism
  • Machine Learning
  • Scalable KMeans

Fingerprint

Dive into the research topics of 'Massively Scalable Parallel KMeans on the HPCC Systems Platform'. Together they form a unique fingerprint.

Cite this