Residential College | false |
Status | 已發表Published |
Raptor-T: A Fused and Memory-Efficient Sparse Transformer for Long and Variable-Length Sequences | |
Wang, Hulin1; Yang, Donglin2; Xia, Yaqi1; Zhang, Zheng1; Wang, Qigang3; Fan, Jianping3; Zhou, Xiaobo4; Cheng, Dazhao1 | |
2024-07 | |
Source Publication | IEEE TRANSACTIONS ON COMPUTERS |
ISSN | 0018-9340 |
Volume | 73Issue:7Pages:1852-1865 |
Abstract | Transformer-based models have made significant advancements across various domains, largely due to the self-attention mechanism’s ability to capture contextual relationships in input sequences. However, processing long sequences remains computationally expensive for Transformer models, primarily due to the O(n ) complexity associated with self-attention. To address this, sparse attention has been proposed to reduce the quadratic dependency to linear. Nevertheless, deploying the sparse transformer efficiently encounters two major obstacles: 1) Existing system optimizations are less effective for the sparse transformer due to the algorithm’s approximation properties leading to fragmented attention, and 2) the variability of input sequences results in computation and memory access inefficiencies. We present Raptor-T, a cutting-edge transformer framework designed for handling long and variable-length sequences. Raptor-T harnesses the power of the sparse transformer to reduce resource requirements for processing long sequences while also implementing system-level optimizations to accelerate inference performance. To address the fragmented attention issue, Raptor-T employs fused and memory-efficient Multi-Head Attention. Additionally, we introduce an asynchronous data processing method to mitigate GPU-blocking operations caused by sparse attention. Furthermore, Raptor-T minimizes padding for variable-length inputs, effectively reducing the overhead associated with padding and achieving balanced computation on GPUs. In evaluation, we compare Raptor-T’s performance against state-of-the-art frameworks on an NVIDIA A100 GPU. The experimental results demonstrate that Raptor-T outperforms FlashAttention-2 and FasterTransformer, achieving an impressive average end-to-end performance improvement of 3.41X and 3.71X, respectively. |
Keyword | Sparse Transformer Inference Acceleration Gpu Deep Learning Memory Optimization Resource Management |
DOI | 10.1109/TC.2024.3389507 |
URL | View the original |
Indexed By | SCIE |
Language | 英語English |
WOS Research Area | Computer Science ; Engineering |
WOS Subject | Computer Science, Hardware & Architecture ; Engineering, Electrical & Electronic |
WOS ID | WOS:001246169700017 |
Publisher | IEEE COMPUTER SOC, 10662 LOS VAQUEROS CIRCLE, PO BOX 3014, LOS ALAMITOS, CA 90720-1314 |
Scopus ID | 2-s2.0-85190736521 |
Fulltext Access | |
Citation statistics | |
Document Type | Journal article |
Collection | Faculty of Science and Technology THE STATE KEY LABORATORY OF INTERNET OF THINGS FOR SMART CITY (UNIVERSITY OF MACAU) DEPARTMENT OF COMPUTER AND INFORMATION SCIENCE |
Corresponding Author | Cheng, Dazhao |
Affiliation | 1.School of Computer Science, Wuhan University, Hubei 430072, China 2.Nvidia Corp., Santa Clara, CA 95051 USA 3.AI Lab, Lenovo Research, Beijing 100094, China 4.IOTSC & Department of Computer and Information Sciences, University of Macau, Macau S.A.R. 999078, China |
Recommended Citation GB/T 7714 | Wang, Hulin,Yang, Donglin,Xia, Yaqi,et al. Raptor-T: A Fused and Memory-Efficient Sparse Transformer for Long and Variable-Length Sequences[J]. IEEE TRANSACTIONS ON COMPUTERS, 2024, 73(7), 1852-1865. |
APA | Wang, Hulin., Yang, Donglin., Xia, Yaqi., Zhang, Zheng., Wang, Qigang., Fan, Jianping., Zhou, Xiaobo., & Cheng, Dazhao (2024). Raptor-T: A Fused and Memory-Efficient Sparse Transformer for Long and Variable-Length Sequences. IEEE TRANSACTIONS ON COMPUTERS, 73(7), 1852-1865. |
MLA | Wang, Hulin,et al."Raptor-T: A Fused and Memory-Efficient Sparse Transformer for Long and Variable-Length Sequences".IEEE TRANSACTIONS ON COMPUTERS 73.7(2024):1852-1865. |
Files in This Item: | There are no files associated with this item. |
Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.
Edit Comment