UM  > Faculty of Science and Technology
Residential Collegefalse
Status已發表Published
Raptor-T: A Fused and Memory-Efficient Sparse Transformer for Long and Variable-Length Sequences
Wang, Hulin1; Yang, Donglin2; Xia, Yaqi1; Zhang, Zheng1; Wang, Qigang3; Fan, Jianping3; Zhou, Xiaobo4; Cheng, Dazhao1
2024-07
Source PublicationIEEE TRANSACTIONS ON COMPUTERS
ISSN0018-9340
Volume73Issue:7Pages:1852-1865
Abstract

Transformer-based models have made significant advancements across various domains, largely due to the self-attention mechanism’s ability to capture contextual relationships in input sequences. However, processing long sequences remains computationally expensive for Transformer models, primarily due to the O(n ) complexity associated with self-attention. To address this, sparse attention has been proposed to reduce the quadratic dependency to linear. Nevertheless, deploying the sparse transformer efficiently encounters two major obstacles: 1) Existing system optimizations are less effective for the sparse transformer due to the algorithm’s approximation properties leading to fragmented attention, and 2) the variability of input sequences results in computation and memory access inefficiencies. We present Raptor-T, a cutting-edge transformer framework designed for handling long and variable-length sequences. Raptor-T harnesses the power of the sparse transformer to reduce resource requirements for processing long sequences while also implementing system-level optimizations to accelerate inference performance. To address the fragmented attention issue, Raptor-T employs fused and memory-efficient Multi-Head Attention. Additionally, we introduce an asynchronous data processing method to mitigate GPU-blocking operations caused by sparse attention. Furthermore, Raptor-T minimizes padding for variable-length inputs, effectively reducing the overhead associated with padding and achieving balanced computation on GPUs. In evaluation, we compare Raptor-T’s performance against state-of-the-art frameworks on an NVIDIA A100 GPU. The experimental results demonstrate that Raptor-T outperforms FlashAttention-2 and FasterTransformer, achieving an impressive average end-to-end performance improvement of 3.41X and 3.71X, respectively.

KeywordSparse Transformer Inference Acceleration Gpu Deep Learning Memory Optimization Resource Management
DOI10.1109/TC.2024.3389507
URLView the original
Indexed BySCIE
Language英語English
WOS Research AreaComputer Science ; Engineering
WOS SubjectComputer Science, Hardware & Architecture ; Engineering, Electrical & Electronic
WOS IDWOS:001246169700017
PublisherIEEE COMPUTER SOC, 10662 LOS VAQUEROS CIRCLE, PO BOX 3014, LOS ALAMITOS, CA 90720-1314
Scopus ID2-s2.0-85190736521
Fulltext Access
Citation statistics
Document TypeJournal article
CollectionFaculty of Science and Technology
THE STATE KEY LABORATORY OF INTERNET OF THINGS FOR SMART CITY (UNIVERSITY OF MACAU)
DEPARTMENT OF COMPUTER AND INFORMATION SCIENCE
Corresponding AuthorCheng, Dazhao
Affiliation1.School of Computer Science, Wuhan University, Hubei 430072, China
2.Nvidia Corp., Santa Clara, CA 95051 USA
3.AI Lab, Lenovo Research, Beijing 100094, China
4.IOTSC & Department of Computer and Information Sciences, University of Macau, Macau S.A.R. 999078, China
Recommended Citation
GB/T 7714
Wang, Hulin,Yang, Donglin,Xia, Yaqi,et al. Raptor-T: A Fused and Memory-Efficient Sparse Transformer for Long and Variable-Length Sequences[J]. IEEE TRANSACTIONS ON COMPUTERS, 2024, 73(7), 1852-1865.
APA Wang, Hulin., Yang, Donglin., Xia, Yaqi., Zhang, Zheng., Wang, Qigang., Fan, Jianping., Zhou, Xiaobo., & Cheng, Dazhao (2024). Raptor-T: A Fused and Memory-Efficient Sparse Transformer for Long and Variable-Length Sequences. IEEE TRANSACTIONS ON COMPUTERS, 73(7), 1852-1865.
MLA Wang, Hulin,et al."Raptor-T: A Fused and Memory-Efficient Sparse Transformer for Long and Variable-Length Sequences".IEEE TRANSACTIONS ON COMPUTERS 73.7(2024):1852-1865.
Files in This Item:
There are no files associated with this item.
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[Wang, Hulin]'s Articles
[Yang, Donglin]'s Articles
[Xia, Yaqi]'s Articles
Baidu academic
Similar articles in Baidu academic
[Wang, Hulin]'s Articles
[Yang, Donglin]'s Articles
[Xia, Yaqi]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[Wang, Hulin]'s Articles
[Yang, Donglin]'s Articles
[Xia, Yaqi]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.