UM
Residential Collegefalse
Status已發表Published
The Neglected Tails in Vision-Language Models
Parashar, Shubham3; Lin, Zhiqiu1; Liu, Tian3; Dong, Xiangjue3; Li, Yanan2; Ramanan, Deva1; Caverlee, James3; Kong, Shu3,4
2024
Conference Name2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
Source PublicationProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Pages12988-12997
Conference Date16 June 2024through 22 June 2024
Conference PlaceSeattle
PublisherIEEE Computer Society
Abstract

Vision-language models (VLMs) excel in zero-shot recognition but their performance varies greatly across different visual concepts. For example, although CLIP achieves impressive accuracy on ImageNet (60-80%), its performance drops below 10% for more than ten concepts like night snake, presumably due to their limited presence in the pretraining data. However, measuring the frequency of concepts in VLMs' large-scale datasets is challenging. We address this by using large language models (LLMs) to count the number of pretraining texts that con-tain synonyms of these concepts. Our analysis confirms that popular datasets, such as LAION, exhibit a long-tailed concept distribution, yielding biased performance in VLMs. We also find that downstream applications of VLMs, including visual chatbots (e.g., GPT-4V) and text-to-image models (e.g., Stable Diffusion), often fail to recognize or generate images of rare concepts identified by our method. To mit-igate the imbalanced performance of zero-shot VLMs, we propose REtrieval-Augmented Learning (REAL). First, in-stead of prompting VLMs using the original class names, REAL uses their most frequent synonyms found in pretraining texts. This simple change already outperforms costly human-engineered and LLM-enriched prompts over nine benchmark datasets. Second, REAL trains a linear classifier on a small yet balanced set of pretraining data re-trieved using concept synonyms. REAL surpasses the previous zero-shot SOTA, using 400× less storage and 10,000× less training time!

KeywordLong Tailed Recognition Vision-language Models Zero-shot Recognition
DOI10.1109/CVPR52733.2024.01234
URLView the original
Language英語English
Scopus ID2-s2.0-85206616723
Fulltext Access
Citation statistics
Document TypeConference paper
CollectionUniversity of Macau
Affiliation1.Carnegie Mellon University, United States
2.Zhejiang Lab, China
3.Texas A&m University, United States
4.University of Macau, Macao
Recommended Citation
GB/T 7714
Parashar, Shubham,Lin, Zhiqiu,Liu, Tian,et al. The Neglected Tails in Vision-Language Models[C]:IEEE Computer Society, 2024, 12988-12997.
APA Parashar, Shubham., Lin, Zhiqiu., Liu, Tian., Dong, Xiangjue., Li, Yanan., Ramanan, Deva., Caverlee, James., & Kong, Shu (2024). The Neglected Tails in Vision-Language Models. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 12988-12997.
Files in This Item:
There are no files associated with this item.
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[Parashar, Shubham]'s Articles
[Lin, Zhiqiu]'s Articles
[Liu, Tian]'s Articles
Baidu academic
Similar articles in Baidu academic
[Parashar, Shubham]'s Articles
[Lin, Zhiqiu]'s Articles
[Liu, Tian]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[Parashar, Shubham]'s Articles
[Lin, Zhiqiu]'s Articles
[Liu, Tian]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.