Residential Collegefalse
Status已發表Published
GPT-4 enhanced multimodal grounding for autonomous driving: Leveraging cross-modal attention with large language models
Liao, Haicheng1; Shen, Huanming2; Li, Zhenning3; Wang, Chengyue4; Li, Guofa5; Bie, Yiming6; Xu, Chengzhong1
2024-12
Source PublicationCommunications in Transportation Research
ISSN2772-4247
Volume4Pages:100116
Abstract

In the field of autonomous vehicles (AVs), accurately discerning commander intent and executing linguistic commands within a visual context presents a significant challenge. This paper introduces a sophisticated encoder-decoder framework, developed to address visual grounding in AVs. Our Context-Aware Visual Grounding (CAVG) model is an advanced system that integrates five core encoders—Text, Emotion, Image, Context, and Cross-Modal—with a multimodal decoder. This integration enables the CAVG model to adeptly capture contextual semantics and to learn human emotional features, augmented by state-of-the-art Large Language Models (LLMs) including GPT-4. The architecture of CAVG is reinforced by the implementation of multi-head cross-modal attention mechanisms and a Region-Specific Dynamic (RSD) layer for attention modulation. This architectural design enables the model to efficiently process and interpret a range of cross-modal inputs, yielding a comprehensive understanding of the correlation between verbal commands and corresponding visual scenes. Empirical evaluations on the Talk2Car dataset, a real-world benchmark, demonstrate that CAVG establishes new standards in prediction accuracy and operational efficiency. Notably, the model exhibits exceptional performance even with limited training data, ranging from 50% to 75% of the full dataset. This feature highlights its effectiveness and potential for deployment in practical AV applications. Moreover, CAVG has shown remarkable robustness and adaptability in challenging scenarios, including long-text command interpretation, low-light conditions, ambiguous command contexts, inclement weather conditions, and densely populated urban environments.

KeywordAutonomous Driving Cross-modal Attention Human-machine Interaction Large Language Models Visual Grounding
DOI10.1016/j.commtr.2023.100116
URLView the original
Indexed ByESCI
Language英語English
WOS Research AreaTransportation
WOS SubjectTransportation ; Transportation Science & Technology
WOS IDWOS:001202487400001
PublisherELSEVIER, RADARWEG 29, 1043 NX AMSTERDAM, NETHERLANDS
Scopus ID2-s2.0-85185594715
Fulltext Access
Citation statistics
Document TypeJournal article
CollectionDEPARTMENT OF CIVIL AND ENVIRONMENTAL ENGINEERING
Faculty of Science and Technology
THE STATE KEY LABORATORY OF INTERNET OF THINGS FOR SMART CITY (UNIVERSITY OF MACAU)
DEPARTMENT OF COMPUTER AND INFORMATION SCIENCE
Corresponding AuthorLi, Zhenning; Xu, Chengzhong
Affiliation1.State Key Laboratory of Internet of Things for Smart City and Department of Computer and Information Science, University of Macau, Macau SAR, 999078, China
2.Department of Information and Software Engineering, University of Electronic Science and Technology of China, Chengdu, 610000, China
3.State Key Laboratory of Internet of Things for Smart City and Departments of Civil and Environmental Engineering and Computer and Information Science, University of Macau, Macau SAR, 999078, China
4.State Key Laboratory of Internet of Things for Smart City and Departments of Civil and Environmental Engineering, University of Macau, Macau SAR, 999078, China
5.College of Mechanical and Vehicle Engineering, Chongqing University, Chongqing, 400030, China
6.School of Transportation, Jilin University, Changchun, 130000, China
First Author AffilicationUniversity of Macau
Corresponding Author AffilicationUniversity of Macau
Recommended Citation
GB/T 7714
Liao, Haicheng,Shen, Huanming,Li, Zhenning,et al. GPT-4 enhanced multimodal grounding for autonomous driving: Leveraging cross-modal attention with large language models[J]. Communications in Transportation Research, 2024, 4, 100116.
APA Liao, Haicheng., Shen, Huanming., Li, Zhenning., Wang, Chengyue., Li, Guofa., Bie, Yiming., & Xu, Chengzhong (2024). GPT-4 enhanced multimodal grounding for autonomous driving: Leveraging cross-modal attention with large language models. Communications in Transportation Research, 4, 100116.
MLA Liao, Haicheng,et al."GPT-4 enhanced multimodal grounding for autonomous driving: Leveraging cross-modal attention with large language models".Communications in Transportation Research 4(2024):100116.
Files in This Item:
There are no files associated with this item.
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[Liao, Haicheng]'s Articles
[Shen, Huanming]'s Articles
[Li, Zhenning]'s Articles
Baidu academic
Similar articles in Baidu academic
[Liao, Haicheng]'s Articles
[Shen, Huanming]'s Articles
[Li, Zhenning]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[Liao, Haicheng]'s Articles
[Shen, Huanming]'s Articles
[Li, Zhenning]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.