TY - JOUR
T1 - Massively scalable density based clustering (Dbscan) on the hpcc systems big data platform
AU - Yatish, H. R.
AU - Phal, Shubham Milind
AU - Hukkeri, Tanmay Sanjay
AU - Xu, Lili
AU - Shobha, G.
AU - Shetty, Jyoti
AU - Chala, Arjuna
N1 - Publisher Copyright:
© 2021, Institute of Advanced Engineering and Science. All rights reserved.
PY - 2021
Y1 - 2021
N2 - Dealing with large samples of unlabeled data is a key challenge in today’s world, especially in applications such as traffic pattern analysis and disaster management. DBSCAN, or density based spatial clustering of applications with noise, is a well-known density-based clustering algorithm. Its key strengths lie in its capability to detect outliers and handle arbitrarily shaped clusters. However, the algorithm, being fundamentally sequential in nature, proves expensive and time consuming when operated on extensively large data chunks. This paper thus presents a novel implementation of a parallel and distributed DBSCAN algorithm on the HPCC systems platform. The algorithm seeks to fully parallelize the algorithm implementation by making use of HPCC systems optimal distributed architecture and performing a tree-based union to merge local clusters. The proposed approach* was tested both on synthetic as well as standard datasets (MFCCs Data Set) and found to be completely accurate. Additionally, when compared against a single node setup, a significant decrease in computation time was observed with no impact to accuracy. The parallelized algorithm performed eight times better for higher number of data points and takes exponentially lesser time as the number of data points increases.
AB - Dealing with large samples of unlabeled data is a key challenge in today’s world, especially in applications such as traffic pattern analysis and disaster management. DBSCAN, or density based spatial clustering of applications with noise, is a well-known density-based clustering algorithm. Its key strengths lie in its capability to detect outliers and handle arbitrarily shaped clusters. However, the algorithm, being fundamentally sequential in nature, proves expensive and time consuming when operated on extensively large data chunks. This paper thus presents a novel implementation of a parallel and distributed DBSCAN algorithm on the HPCC systems platform. The algorithm seeks to fully parallelize the algorithm implementation by making use of HPCC systems optimal distributed architecture and performing a tree-based union to merge local clusters. The proposed approach* was tested both on synthetic as well as standard datasets (MFCCs Data Set) and found to be completely accurate. Additionally, when compared against a single node setup, a significant decrease in computation time was observed with no impact to accuracy. The parallelized algorithm performed eight times better for higher number of data points and takes exponentially lesser time as the number of data points increases.
KW - Big data
KW - Data mining
KW - Density based clustering
KW - Distributed computing
KW - HPCC systems
KW - Machine learning
UR - http://www.scopus.com/inward/record.url?scp=85103135577&partnerID=8YFLogxK
U2 - 10.11591/ijai.v10.i1.pp207-214
DO - 10.11591/ijai.v10.i1.pp207-214
M3 - Artículo
AN - SCOPUS:85103135577
SN - 2089-4872
VL - 10
SP - 207
EP - 214
JO - IAES International Journal of Artificial Intelligence
JF - IAES International Journal of Artificial Intelligence
IS - 1
ER -