Language-free Visual Representation Learning #meta #NYU #princeton

AI on AI Podcast

0:00

-26:14

Language-free Visual Representation Learning #meta #NYU #princeton

language is not the only learning

Srikanth Bhakthan

Apr 03, 2025

arxiv: https://arxiv.org/pdf/2504.01017

blog: https://davidfan.io/webssl/

TITLE

“Scaling Language-Free Visual Representation Learning”

AUTHORS & AFFILIATIONS

• David Fan¹,⁎, Shengbang Tong¹,²,⁎, Jiachen Zhu¹,², Koustuv Sinha¹, Zhuang Liu¹,³, Xinlei Chen¹,

Michael Rabbat¹, Nicolas Ballas¹, Yann LeCun¹,², Amir Bar¹,†, Saining Xie²,†

• ¹ FAIR, Meta; ² New York University; ³ Princeton University

• ⁎ equal contribution, † equal advising

PUBLICATION DATE

April 1, 2025

MAIN THEMES

• Comparing purely visual Self-Supervised Learning (SSL) to language-supervised CLIP (Contrastive Language-Image Pretraining) in multimodal tasks.

• Exploring whether “language-free” models truly lag behind language-based methods or if data differences drive most performance gaps.

• Demonstrating that visual SSL at large scale (data and model size) can match or exceed text-supervised approaches on a range of tasks, particularly visual question answering (VQA).

KEY FINDINGS & CAPABILITIES

1. Visual SSL Matches CLIP-Level Performance at Scale

• “Pure visual SSL can match language-supervised visual pretraining at scale.”

• When both are trained on identical web-scale datasets (“MC-2B,” 2B images), the gap in VQA performance diminishes or even reverses at high model capacities (up to 7B parameters).

2. Strong Performance Across Diverse VQA Tasks

• Achieves “CLIP-level performance on a wide range of VQA and classic vision benchmarks.”

• This includes challenging OCR & Chart tasks—traditionally considered highly text-dependent—where SSL-based models can now perform comparably (or better if the data distribution is properly filtered).

3. Data Composition is Crucial

• Training on data subsets containing a higher ratio of text (charts, documents, etc.) substantially boosts OCR & Chart performance.

• “Simple data filtering can outperform language supervision on the full data.”

4. Larger Models Continue to Improve

• SSL performance scales log-linearly with higher parameter counts, especially on tasks involving complex visuals and embedded text.

• In contrast, the authors find that CLIP’s performance tends to plateau beyond moderate scales.

5. Competitive on Classic Vision Benchmarks

• Despite specializing in VQA, SSL models still “maintain competitive traditional vision performance” on classification (ImageNet), segmentation (ADE20k), and depth estimation (NYUv2).

QUOTES FROM THE TEXT (EXCERPTS)

• “Do visual self-supervised approaches lag behind CLIP due to the lack of language supervision, or differences in the training data?”

• “Visual SSL… performance does not saturate even after scaling up to 7B parameters.”

• “Pure visual SSL models… can match language-supervised visual pretraining at scale.”

• “Simple data filtering can outperform language supervision on the full data.”

APPLICATIONS IN BUSINESS

• OCR & Data Digitization: Businesses requiring OCR capabilities (e.g., document processing, invoice scanning, chart/diagram reading) can leverage scaled SSL models without needing text-labeled data.

• Large-Scale Visual Search: Enterprises with massive unstructured image collections can reduce annotation costs by turning to purely visual SSL for feature extraction and search indexing.

• Multimedia Analytics & Insights: Companies analyzing user-generated content, social media images, or e-commerce product catalogs can use SSL-based encoders to perform advanced visual question answering, categorization, and insight extraction.

• Cost-Effective Vision AI Deployment: Eliminating the reliance on language-text pairs can reduce overhead for data collection and annotation, making powerful vision AI more accessible.

OTHER RELEVANT INFORMATION

• Evaluation Protocol:

– Uses Cambrian-1 VQA suite (16 tasks: General, Knowledge, OCR & Chart, Vision-Centric).

– Complementary tests on classic benchmarks (ImageNet, ADE20k, NYUv2) show SSL’s robustness.

• Future Outlook:

– Results suggest further potential for “vision-centric” training.

– Emphasizes data diversity and targeted filtering for specialized tasks.

– Encourages pushing model sizes beyond 7B and exploring even more diverse datasets.

• Open-Sourcing & Community Impact:

– The authors plan to open-source these “Web-SSL” vision models (1B to 7B parameters).

– Aims to cultivate broader adoption of purely visual SSL in an era that has largely favored CLIP-based approaches.

Created with o1 & tts-hd

Srikanth Bhakthan

Language-free Visual Representation Learning #meta #NYU #princeton

Ready for more?