EXEED AI

Massimiliano Viola's Recent LinkedIn Posts

MV

Massimiliano Viola

@massimiliano-viola

Research @Stanford | ML @Cerrion | Computer Vision • 3D • Generative Models

en1 postsLinkedIn

Posts

Massimiliano Viola

Tech & AI

2mo

This DEFINITELY flew under the radar: just a few days ago, AI at Meta released V-JEPA 2.1, taking a massive step toward closing the gap between image and video domains. For a long time, image backbones were the only option for solving dense vision tasks. This model disagrees, showing that universal spatial understanding also emerges from large-scale video models! 🎥 Quick recap on V-JEPA: it is a joint embedding predictive architecture built on a classic teacher-student setup. The teacher sees the full video, and its weights slowly update as an exponential moving average of the student. The student sees a masked input and predicts the latent features of the missing regions rather than reconstructing them in pixel space. What changed between V1 and V2 was largely a matter of scale. The encoder grew to a 1B-parameter ViT-g, the dataset from 2M to 22M videos, training got longer and progressive, and clips were pushed to higher temporal and spatial resolution. V2 also introduced images into the mix via temporal duplication, training on 1M ImageNet samples. But the difference between V2 and V2.1 is conceptual, on top of just scaling. Sure, they pushed the model to 2B parameters and expanded the image dataset from 1M to 142M, but the real breakthrough lies in the training loss. In V-JEPA 2, supervision was only applied to the masked regions, despite the predictor outputting a token for every input, masked or not. Thus, the visible tokens were free to ignore local structure and aggregate global information if that would minimize the loss, similar to register tokens. V-JEPA 2.1 fixes this by extending supervision to the visible tokens too. Every patch, masked or visible, now has a training signal forcing it to encode where things actually are in space and time. This results in feature maps that look nothing like before: spatially structured, semantically coherent, and temporally consistent. Looking at the features below, you would almost think this is some small variant of DINOv3 (with due respect), except these results came from video pretraining! 🤯 This feature quality obviously translates to downstream tasks. Motion benchmarks got only a small buff, but spatial tasks are where the gains are staggering, with improvements ranging anywhere from 30 to 95%. The idea that we now basically have a SOTA image encoder baked into video features is crazy to me, and as someone working with video models on a daily basis, I could not be happier to put this to the test and distill it down into even smaller and faster variants than the smallest 80M. Resources are down in the comments. Try it out if you were using the previous version, and let me know how it goes! ⏬
1.6K