This project involved implementing the DBSCAN algorithm on a multi-node setup. The approach was to leverage several of ECL’s existing paradigms for distributed computing. The high dimensional data was first sprayed onto Thor. Next, local clustering was performed at each HPCC node and the results were stored in a record structure. Finally the local clusters across nodes were merged with a tree-based union find data structure. An ECL interface was created to abstract the implementation and to provide users with the option to choose from a multitude of distance metrics. The algorithm was compared against the standard implementations provided by the python machine learning packages such as sci-kit. The results showed significant gains in speedup with no dip in accuracy.
|Effective start/end date||01/1/19 → 12/31/19|