LOCALLY ACCUMULATED ADAM FOR DISTRIBUTED TRAINING WITH SPARSE UPDATES

Research output: Chapter in Book/Report/Conference proceedingConference paperResearch

Abstract

The high bandwidth required for gradient exchange is a bottleneck for the distributed training of large transformer models. Most sparsification approaches focus on gradient compression for convolutional neural networks (CNNs) optimized by SGD. In this work, we show that performing local gradient accumulation when using Adam to optimize transformers in distributed fashion leads to a misled optimization direction and we address this problem by accumulating the optimization direction locally. We also empirically demonstrate most sparse gradients do not overlap and thus show that sparsification is comparable to an asynchronous update. Our experiments with classification and segmentation tasks show that our method can still maintain the correct optimization direction in distributed training event under highly sparse updates
Original languageEnglish
Title of host publication2023 IEEE International Conference on Image Processing
Place of PublicationIEEE
PublisherIEEE
Pages2395-2399
Number of pages5
ISBN (Electronic)978-1-7281-9835-4
ISBN (Print)978-1-7281-9836-1
DOIs
Publication statusPublished - 2023
Event2023 IEEE International Conference on Image Processing - Kuala Lumpur Convention Center (KLCC), Kuala Lumpur, Malaysia
Duration: 8 Oct 202311 Oct 2023
https://2023.ieeeicip.org/

Publication series

NameIEEE International Conference on Image Processing
PublisherIEEE
ISSN (Print)1053-5888
ISSN (Electronic)1558-0792

Conference

Conference2023 IEEE International Conference on Image Processing
Abbreviated titleICIP2023
Country/TerritoryMalaysia
CityKuala Lumpur
Period8/10/2311/10/23
Internet address

Keywords

  • Distributed Learning
  • Vision Transformer
  • Gradient Compression
  • Optimization

Fingerprint

Dive into the research topics of 'LOCALLY ACCUMULATED ADAM FOR DISTRIBUTED TRAINING WITH SPARSE UPDATES'. Together they form a unique fingerprint.

Cite this