A parallel and distributed stochastic gradient descent implementation using commodity clusters

Robert Kennedy

Research output: Contribution to journalArticle

Abstract

Deep Learning is an increasingly important subdomain of artificial intelligence, which benefits from training on Big Data. The size and complexity of the model combined with the size of the training dataset makes the training process very computationally and temporally expensive. Accelerating the training process of Deep Learning using cluster computers faces many challenges ranging from distributed optimizers to the large communication overhead specific to systems with off the shelf networking components. In this paper, we present a novel distributed and parallel implementation of stochastic gradient descent (SGD) on a distributed cluster of commodity computers. We use high-performance computing cluster (HPCC) systems as the underlying cluster environment for the implementation. We overview how the HPCC systems platform provides the environment for distributed and parallel Deep Learning, how it provides a facility to work with third party open source libraries such as TensorFlow, and detail our use of third-party libraries and HPCC functionality for implementation. We pro-vide experimental results that validate our work and show that our implementation can scale with respect to both dataset size and the number of compute nodes in the cluster.
Original languageAmerican English
Article number16
JournalJournal of Big Data
Volume6
Issue number1
DOIs
StatePublished - Dec 1 2019

Fingerprint

Cluster computing
Artificial intelligence
Communication
Deep learning
Gradient
Commodities

Keywords

  • Big data
  • Cluster computer
  • Deep learning
  • HPCC systems
  • Neural network
  • Parallel and distributed processing
  • Parallel stochastic gradient descent

Cite this

@article{27e4426d9c024c339dfad113d68794fa,
title = "A parallel and distributed stochastic gradient descent implementation using commodity clusters",
abstract = "Deep Learning is an increasingly important subdomain of artificial intelligence, which benefits from training on Big Data. The size and complexity of the model combined with the size of the training dataset makes the training process very computationally and temporally expensive. Accelerating the training process of Deep Learning using cluster computers faces many challenges ranging from distributed optimizers to the large communication overhead specific to systems with off the shelf networking components. In this paper, we present a novel distributed and parallel implementation of stochastic gradient descent (SGD) on a distributed cluster of commodity computers. We use high-performance computing cluster (HPCC) systems as the underlying cluster environment for the implementation. We overview how the HPCC systems platform provides the environment for distributed and parallel Deep Learning, how it provides a facility to work with third party open source libraries such as TensorFlow, and detail our use of third-party libraries and HPCC functionality for implementation. We pro-vide experimental results that validate our work and show that our implementation can scale with respect to both dataset size and the number of compute nodes in the cluster.",
keywords = "Big data, Cluster computer, Deep learning, HPCC systems, Neural network, Parallel and distributed processing, Parallel stochastic gradient descent",
author = "Robert Kennedy",
year = "2019",
month = "12",
day = "1",
doi = "10.1186/s40537-019-0179-2",
language = "American English",
volume = "6",
journal = "Journal of Big Data",
number = "1",

}

A parallel and distributed stochastic gradient descent implementation using commodity clusters. / Kennedy, Robert.

In: Journal of Big Data, Vol. 6, No. 1, 16, 01.12.2019.

Research output: Contribution to journalArticle

TY - JOUR

T1 - A parallel and distributed stochastic gradient descent implementation using commodity clusters

AU - Kennedy, Robert

PY - 2019/12/1

Y1 - 2019/12/1

N2 - Deep Learning is an increasingly important subdomain of artificial intelligence, which benefits from training on Big Data. The size and complexity of the model combined with the size of the training dataset makes the training process very computationally and temporally expensive. Accelerating the training process of Deep Learning using cluster computers faces many challenges ranging from distributed optimizers to the large communication overhead specific to systems with off the shelf networking components. In this paper, we present a novel distributed and parallel implementation of stochastic gradient descent (SGD) on a distributed cluster of commodity computers. We use high-performance computing cluster (HPCC) systems as the underlying cluster environment for the implementation. We overview how the HPCC systems platform provides the environment for distributed and parallel Deep Learning, how it provides a facility to work with third party open source libraries such as TensorFlow, and detail our use of third-party libraries and HPCC functionality for implementation. We pro-vide experimental results that validate our work and show that our implementation can scale with respect to both dataset size and the number of compute nodes in the cluster.

AB - Deep Learning is an increasingly important subdomain of artificial intelligence, which benefits from training on Big Data. The size and complexity of the model combined with the size of the training dataset makes the training process very computationally and temporally expensive. Accelerating the training process of Deep Learning using cluster computers faces many challenges ranging from distributed optimizers to the large communication overhead specific to systems with off the shelf networking components. In this paper, we present a novel distributed and parallel implementation of stochastic gradient descent (SGD) on a distributed cluster of commodity computers. We use high-performance computing cluster (HPCC) systems as the underlying cluster environment for the implementation. We overview how the HPCC systems platform provides the environment for distributed and parallel Deep Learning, how it provides a facility to work with third party open source libraries such as TensorFlow, and detail our use of third-party libraries and HPCC functionality for implementation. We pro-vide experimental results that validate our work and show that our implementation can scale with respect to both dataset size and the number of compute nodes in the cluster.

KW - Big data

KW - Cluster computer

KW - Deep learning

KW - HPCC systems

KW - Neural network

KW - Parallel and distributed processing

KW - Parallel stochastic gradient descent

UR - http://www.scopus.com/inward/record.url?scp=85061476911&partnerID=8YFLogxK

U2 - 10.1186/s40537-019-0179-2

DO - 10.1186/s40537-019-0179-2

M3 - Article

VL - 6

JO - Journal of Big Data

JF - Journal of Big Data

IS - 1

M1 - 16

ER -