Residential College | false |
Status | 已發表Published |
Heet: Accelerating Elastic Training in Heterogeneous Deep Learning Clusters | |
Mo, Zizhao; Xu, Huanle![]() ![]() | |
2024-04-27 | |
Conference Name | 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2024 |
Source Publication | International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS
![]() |
Volume | 2 |
Pages | 499-513 |
Conference Date | 27 April 2024through 1 May 2024 |
Conference Place | San Diego |
Publisher | Association for Computing Machinery |
Abstract | Modern GPU clusters inherently exhibit heterogeneity, encompassing various aspects such as computation and communication. This heterogeneity poses a significant challenge for the elastic scheduling of deep learning workloads. Unfortunately, existing elastic schedulers often overlook the impact of heterogeneity on scaling efficiency, resulting in considerably prolonged job completion times.In this paper, we present Heet, a new Heterogeneity-aware system explicitly developed for elastic training in DL clusters. Heet addresses two critical issues. First, it utilizes a 3-D collaborative filtering method to accurately measure the scaling efficiency of all elastic configurations on heterogeneous hosts, substantially reducing profiling overhead. Second, Heet introduces a unique price function to effectively balance scaling efficiency and scheduling efficiency. Building upon this function, Heet incorporates a scalable mechanism that employs minimum-weight full bipartite matching and opportunistic resource trading to generate dynamic scheduling decisions. Evaluations conducted on cloud clusters and large-scale simulations demonstrate that Heet can reduce job completion time by up to 2.46× compared to existing solutions. |
DOI | 10.1145/3620665.3640375 |
URL | View the original |
Language | 英語English |
Scopus ID | 2-s2.0-85192182384 |
Fulltext Access | |
Citation statistics | |
Document Type | Conference paper |
Collection | Faculty of Science and Technology DEPARTMENT OF COMPUTER AND INFORMATION SCIENCE |
Affiliation | University of Macau, Macau, SAR, Macao |
First Author Affilication | University of Macau |
Recommended Citation GB/T 7714 | Mo, Zizhao,Xu, Huanle,Xu, Chengzhong. Heet: Accelerating Elastic Training in Heterogeneous Deep Learning Clusters[C]:Association for Computing Machinery, 2024, 499-513. |
APA | Mo, Zizhao., Xu, Huanle., & Xu, Chengzhong (2024). Heet: Accelerating Elastic Training in Heterogeneous Deep Learning Clusters. International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS, 2, 499-513. |
Files in This Item: | There are no files associated with this item. |
Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.
Edit Comment