arxiv: https://arxiv.org/pdf/2504.01017
blog: https://davidfan.io/webssl/
TITLE
“Scaling Language-Free Visual Representation Learning”
AUTHORS & AFFILIATIONS
• David Fan¹,⁎, Shengbang Tong¹,²,⁎, Jiachen Zhu¹,², Koustuv Sinha¹, Zhuang Liu¹,³, Xinlei Chen¹,
Michael Rabbat¹, Nicolas Ballas¹, Yann LeCun¹,², Amir Bar¹,†, Saining Xie²,†
• ¹ FAIR, Meta; ² New York University; ³ Princeton University
• ⁎ equal contribution, † equal advising
PUBLICATION DATE
April 1, 2025
MAIN THEMES
• Comparing purely visual Self-Supervised Learning (SSL) to language-supervised CLIP (Contrastive Language-Image Pretraining) in multimodal tasks.
• Exploring whether “language-free” models truly lag behind language-based methods or if data differences drive most performance gaps.
• Demonstrating that visual SSL at large scale (data and model size) can match or exceed text-supervised approaches on a range of tasks, particularly visual question answering (VQA).
KEY FINDINGS & CAPABILITIES
1. Visual SSL Matches CLIP-Level Performance at Scale
• “Pure visual SSL can match language-supervised visual pretraining at scale.”
• When both are trained on identical web-scale datasets (“MC-2B,” 2B images), the gap in VQA performance diminishes or even reverses at high model capacities (up to 7B parameters).
2. Strong Performance Across Diverse VQA Tasks
• Achieves “CLIP-level performance on a wide range of VQA and classic vision benchmarks.”
• This includes challenging OCR & Chart tasks—traditionally considered highly text-dependent—where SSL-based models can now perform comparably (or better if the data distribution is properly filtered).
3. Data Composition is Crucial
• Training on data subsets containing a higher ratio of text (charts, documents, etc.) substantially boosts OCR & Chart performance.
• “Simple data filtering can outperform language supervision on the full data.”
4. Larger Models Continue to Improve
• SSL performance scales log-linearly with higher parameter counts, especially on tasks involving complex visuals and embedded text.
• In contrast, the authors find that CLIP’s performance tends to plateau beyond moderate scales.
5. Competitive on Classic Vision Benchmarks
• Despite specializing in VQA, SSL models still “maintain competitive traditional vision performance” on classification (ImageNet), segmentation (ADE20k), and depth estimation (NYUv2).
QUOTES FROM THE TEXT (EXCERPTS)
• “Do visual self-supervised approaches lag behind CLIP due to the lack of language supervision, or differences in the training data?”
• “Visual SSL… performance does not saturate even after scaling up to 7B parameters.”
• “Pure visual SSL models… can match language-supervised visual pretraining at scale.”
• “Simple data filtering can outperform language supervision on the full data.”
APPLICATIONS IN BUSINESS
• OCR & Data Digitization: Businesses requiring OCR capabilities (e.g., document processing, invoice scanning, chart/diagram reading) can leverage scaled SSL models without needing text-labeled data.
• Large-Scale Visual Search: Enterprises with massive unstructured image collections can reduce annotation costs by turning to purely visual SSL for feature extraction and search indexing.
• Multimedia Analytics & Insights: Companies analyzing user-generated content, social media images, or e-commerce product catalogs can use SSL-based encoders to perform advanced visual question answering, categorization, and insight extraction.
• Cost-Effective Vision AI Deployment: Eliminating the reliance on language-text pairs can reduce overhead for data collection and annotation, making powerful vision AI more accessible.
OTHER RELEVANT INFORMATION
• Evaluation Protocol:
– Uses Cambrian-1 VQA suite (16 tasks: General, Knowledge, OCR & Chart, Vision-Centric).
– Complementary tests on classic benchmarks (ImageNet, ADE20k, NYUv2) show SSL’s robustness.
• Future Outlook:
– Results suggest further potential for “vision-centric” training.
– Emphasizes data diversity and targeted filtering for specialized tasks.
– Encourages pushing model sizes beyond 7B and exploring even more diverse datasets.
• Open-Sourcing & Community Impact:
– The authors plan to open-source these “Web-SSL” vision models (1B to 7B parameters).
– Aims to cultivate broader adoption of purely visual SSL in an era that has largely favored CLIP-based approaches.
Created with o1 & tts-hd



