Residential College | false |
Status | 已發表Published |
MIVCN: Multimodal interaction video captioning network based on semantic association graph | |
Wang, Ying1; Huang, Guoheng1![]() ![]() ![]() ![]() ![]() | |
2021-08-07 | |
Source Publication | APPLIED INTELLIGENCE
![]() |
ISSN | 0924-669X |
Volume | 52Issue:5Pages:5241-5260 |
Abstract | In the field of computer vision, it is a challenging task to generate natural language captions from videos as input. To deal with this task, videos are usually regarded as feature sequences and input into Long-Short Term Memory (LSTM) to generate natural language. To get richer and more detailed video content representation, a Multimodal Interaction Video Captioning Network based on Semantic Association Graph (MIVCN) is developed towards this task. This network consists of two modules: Semantic association Graph Module (SAGM) and Multimodal Attention Constraint Module (MACM). Firstly, owing to lack of the semantic interdependence, existing methods often produce illogical sentence structures. Therefore, we propose a SAGM based on information association, which enables network to strengthen the connection between logically related languages and alienate the relations between logically unrelated languages. Secondly, features of each modality need to pay attention to different information among them, and the captured multimodal features are great informative and redundant. Based on the discovery, we propose a MACM based on LSTM, which can capture complementary visual features and filter redundant visual features. The MACM is applied to integrate multimodal features into LSTM, and make network to screen and focus on informative features. Through the association of semantic attributes and the interaction of multimodal features, the semantically contextual interdependent and visually complementary information can be captured by this network, and the informative representation in videos also can be better used for generating captioning. The proposed MIVCN realizes the best caption generation performance on MSVD: 56.8%, 36.4%, and 79.1% on BLEU@4, METEOR, and ROUGE-L evaluation metrics, respectively. Superior results are also reported on MSR-VTT about BLEU@4, METEOR, and ROUGE-L compared to state-of-the-art methods. |
Keyword | Attention Mechanism Gated Recurrent Unit Graph Convolutional Network Long-short Term Memory Multimodal Fusion Video Captioning |
DOI | 10.1007/s10489-021-02612-y |
URL | View the original |
Indexed By | SCIE |
Language | 英語English |
WOS Research Area | Computer Science |
WOS Subject | Computer Science, Artificial Intelligence |
WOS ID | WOS:000682627000001 |
Scopus ID | 2-s2.0-85112645418 |
Fulltext Access | |
Citation statistics | |
Document Type | Journal article |
Collection | Faculty of Science and Technology DEPARTMENT OF COMPUTER AND INFORMATION SCIENCE |
Corresponding Author | Huang, Guoheng; Yuan, Haoliang; Pun, Chi Man; Ling, Wing Kuen |
Affiliation | 1.School of Computer, Guangdong University of Technology, Guangzhou, 510006, China 2.School of Automation, Guangdong University of Technology, Guangzhou, 510006, China 3.Department of Computer and Information Science, University of Macau, 999078, Macao 4.School of Information Engineering, Guangdong University of Technology, Guangzhou, 510006, China |
Corresponding Author Affilication | University of Macau |
Recommended Citation GB/T 7714 | Wang, Ying,Huang, Guoheng,Yuming, Lin,et al. MIVCN: Multimodal interaction video captioning network based on semantic association graph[J]. APPLIED INTELLIGENCE, 2021, 52(5), 5241-5260. |
APA | Wang, Ying., Huang, Guoheng., Yuming, Lin., Yuan, Haoliang., Pun, Chi Man., Ling, Wing Kuen., & Cheng, Lianglun (2021). MIVCN: Multimodal interaction video captioning network based on semantic association graph. APPLIED INTELLIGENCE, 52(5), 5241-5260. |
MLA | Wang, Ying,et al."MIVCN: Multimodal interaction video captioning network based on semantic association graph".APPLIED INTELLIGENCE 52.5(2021):5241-5260. |
Files in This Item: | There are no files associated with this item. |
Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.
Edit Comment