Projects per year
Abstract
The high bandwidth required for gradient exchange is a bottleneck for the distributed training of large transformer models. Most sparsification approaches focus on gradient compression for convolutional neural networks (CNNs) optimized by SGD. In this work, we show that performing local gradient accumulation when using Adam to optimize transformers in distributed fashion leads to a misled optimization direction and we address this problem by accumulating the optimization direction locally. We also empirically demonstrate most sparse gradients do not overlap and thus show that sparsification is comparable to an asynchronous update. Our experiments with classification and segmentation tasks show that our method can still maintain the correct optimization direction in distributed training event under highly sparse updates
Original language | English |
---|---|
Title of host publication | 2023 IEEE International Conference on Image Processing |
Place of Publication | IEEE |
Publisher | IEEE |
Pages | 2395-2399 |
Number of pages | 5 |
ISBN (Electronic) | 978-1-7281-9835-4 |
ISBN (Print) | 978-1-7281-9836-1 |
DOIs | |
Publication status | Published - 2023 |
Event | 2023 IEEE International Conference on Image Processing - Kuala Lumpur Convention Center (KLCC), Kuala Lumpur, Malaysia Duration: 8 Oct 2023 → 11 Oct 2023 https://2023.ieeeicip.org/ |
Publication series
Name | IEEE International Conference on Image Processing |
---|---|
Publisher | IEEE |
ISSN (Print) | 1053-5888 |
ISSN (Electronic) | 1558-0792 |
Conference
Conference | 2023 IEEE International Conference on Image Processing |
---|---|
Abbreviated title | ICIP2023 |
Country/Territory | Malaysia |
City | Kuala Lumpur |
Period | 8/10/23 → 11/10/23 |
Internet address |
Keywords
- Distributed Learning
- Vision Transformer
- Gradient Compression
- Optimization
Fingerprint
Dive into the research topics of 'LOCALLY ACCUMULATED ADAM FOR DISTRIBUTED TRAINING WITH SPARSE UPDATES'. Together they form a unique fingerprint.Projects
- 1 Finished
-
FWOAL883: Video Processing for Multiview Multimodal Camera Systems
Deligiannis, N. & Philips, W.
1/01/18 → 31/12/21
Project: Fundamental
Activities
- 1 Talk or presentation at a conference
-
LOCALLY ACCUMULATED ADAM FOR DISTRIBUTED TRAINING WITH SPARSE UPDATES
Yiming Chen (Speaker) & Nikolaos Deligiannis (Contributor)
8 Oct 2023 → 11 Oct 2023Activity: Talk or presentation › Talk or presentation at a conference