Residential College | false |
Status | 已發表Published |
The Neglected Tails in Vision-Language Models | |
Parashar, Shubham3; Lin, Zhiqiu1; Liu, Tian3; Dong, Xiangjue3; Li, Yanan2; Ramanan, Deva1; Caverlee, James3; Kong, Shu3,4 | |
2024 | |
Conference Name | 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024 |
Source Publication | Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
![]() |
Pages | 12988-12997 |
Conference Date | 16 June 2024through 22 June 2024 |
Conference Place | Seattle |
Publisher | IEEE Computer Society |
Abstract | Vision-language models (VLMs) excel in zero-shot recognition but their performance varies greatly across different visual concepts. For example, although CLIP achieves impressive accuracy on ImageNet (60-80%), its performance drops below 10% for more than ten concepts like night snake, presumably due to their limited presence in the pretraining data. However, measuring the frequency of concepts in VLMs' large-scale datasets is challenging. We address this by using large language models (LLMs) to count the number of pretraining texts that con-tain synonyms of these concepts. Our analysis confirms that popular datasets, such as LAION, exhibit a long-tailed concept distribution, yielding biased performance in VLMs. We also find that downstream applications of VLMs, including visual chatbots (e.g., GPT-4V) and text-to-image models (e.g., Stable Diffusion), often fail to recognize or generate images of rare concepts identified by our method. To mit-igate the imbalanced performance of zero-shot VLMs, we propose REtrieval-Augmented Learning (REAL). First, in-stead of prompting VLMs using the original class names, REAL uses their most frequent synonyms found in pretraining texts. This simple change already outperforms costly human-engineered and LLM-enriched prompts over nine benchmark datasets. Second, REAL trains a linear classifier on a small yet balanced set of pretraining data re-trieved using concept synonyms. REAL surpasses the previous zero-shot SOTA, using 400× less storage and 10,000× less training time! |
Keyword | Long Tailed Recognition Vision-language Models Zero-shot Recognition |
DOI | 10.1109/CVPR52733.2024.01234 |
URL | View the original |
Language | 英語English |
Scopus ID | 2-s2.0-85206616723 |
Fulltext Access | |
Citation statistics | |
Document Type | Conference paper |
Collection | University of Macau |
Affiliation | 1.Carnegie Mellon University, United States 2.Zhejiang Lab, China 3.Texas A&m University, United States 4.University of Macau, Macao |
Recommended Citation GB/T 7714 | Parashar, Shubham,Lin, Zhiqiu,Liu, Tian,et al. The Neglected Tails in Vision-Language Models[C]:IEEE Computer Society, 2024, 12988-12997. |
APA | Parashar, Shubham., Lin, Zhiqiu., Liu, Tian., Dong, Xiangjue., Li, Yanan., Ramanan, Deva., Caverlee, James., & Kong, Shu (2024). The Neglected Tails in Vision-Language Models. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 12988-12997. |
Files in This Item: | There are no files associated with this item. |
Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.
Edit Comment