UM  > Faculty of Science and Technology
Residential Collegefalse
Status已發表Published
Heet: Accelerating Elastic Training in Heterogeneous Deep Learning Clusters
Mo, Zizhao; Xu, Huanle; Xu, Chengzhong
2024-04-27
Conference Name29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2024
Source PublicationInternational Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS
Volume2
Pages499-513
Conference Date27 April 2024through 1 May 2024
Conference PlaceSan Diego
PublisherAssociation for Computing Machinery
Abstract

Modern GPU clusters inherently exhibit heterogeneity, encompassing various aspects such as computation and communication. This heterogeneity poses a significant challenge for the elastic scheduling of deep learning workloads. Unfortunately, existing elastic schedulers often overlook the impact of heterogeneity on scaling efficiency, resulting in considerably prolonged job completion times.In this paper, we present Heet, a new Heterogeneity-aware system explicitly developed for elastic training in DL clusters. Heet addresses two critical issues. First, it utilizes a 3-D collaborative filtering method to accurately measure the scaling efficiency of all elastic configurations on heterogeneous hosts, substantially reducing profiling overhead. Second, Heet introduces a unique price function to effectively balance scaling efficiency and scheduling efficiency. Building upon this function, Heet incorporates a scalable mechanism that employs minimum-weight full bipartite matching and opportunistic resource trading to generate dynamic scheduling decisions. Evaluations conducted on cloud clusters and large-scale simulations demonstrate that Heet can reduce job completion time by up to 2.46× compared to existing solutions.

DOI10.1145/3620665.3640375
URLView the original
Language英語English
Scopus ID2-s2.0-85192182384
Fulltext Access
Citation statistics
Document TypeConference paper
CollectionFaculty of Science and Technology
DEPARTMENT OF COMPUTER AND INFORMATION SCIENCE
AffiliationUniversity of Macau, Macau, SAR, Macao
First Author AffilicationUniversity of Macau
Recommended Citation
GB/T 7714
Mo, Zizhao,Xu, Huanle,Xu, Chengzhong. Heet: Accelerating Elastic Training in Heterogeneous Deep Learning Clusters[C]:Association for Computing Machinery, 2024, 499-513.
APA Mo, Zizhao., Xu, Huanle., & Xu, Chengzhong (2024). Heet: Accelerating Elastic Training in Heterogeneous Deep Learning Clusters. International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS, 2, 499-513.
Files in This Item:
There are no files associated with this item.
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[Mo, Zizhao]'s Articles
[Xu, Huanle]'s Articles
[Xu, Chengzhong]'s Articles
Baidu academic
Similar articles in Baidu academic
[Mo, Zizhao]'s Articles
[Xu, Huanle]'s Articles
[Xu, Chengzhong]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[Mo, Zizhao]'s Articles
[Xu, Huanle]'s Articles
[Xu, Chengzhong]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.