What happens when you treat vision as a first-class citizen during multimodal pretraining? To find out, we studied the design space of training Transfusion-style models that input and output all modalities, from scratch. Here is what we learned about visual representations, data, world modeling, architecture, and scaling behavior!
Paper: https://lnkd.in/eD3UtiFs
Website: https://lnkd.in/eukQzCiZ
Tweet: https://lnkd.in/e2RZQ4UM
1. One encoder (e.g. SigLIP or WebSSL) suffices for both understanding and generation, mirroring recent findings in RAE, and simplifying design. Using separate encoders just adds unnecessary complexity.
2. Vision is compatible with language modeling. Multimodal pretraining also benefits tasks such as visual understanding and generation, and even world modeling.
3. World models and multimodal models are converging! By simply passing in actions as text tokens, multimodal models can naturally learn tasks such as navigation, and be used for MPC planning. No latent actions, no specialized architecture. We even show qualitative out-of-distribution controllability with free-form language like “get out of the shadow!”, unlike works such as Genie. I'm personally very excited about this :)
4. Multimodal MoE works! MoE is well established for LLMs. But what about multimodal models? We find that when paired with RAE, multimodal MoE benefits greatly from higher granularity and sparsity.
5. Vision and Language scale asymmetrically. Our Chinchilla-style IsoFLOP analysis reveals that vision and language scale asymmetrically, with vision being more "data hungry". However, MoE bridges this scaling asymmetry, and is thus a key architectural ingredient for native multimodal pretraining.
On a personal note, this project was a long journey and I grew a lot. More challenging than the technical hurdles was navigating organizational dynamics, advocating the research vision, and securing the compute resources. A huge thanks to the incredible team that made this a reality, to FAIR for the support, and especially to Peter Tong and John Nguyen for sticking together as a team through all the highs and lows!
This work builds upon all of our experiences in and beliefs about visual representations + multimodal modeling. I see this as just the first step towards building intelligent systems that can understand, reason, and plan in the physical world.
Many thanks to my collaborators: Ellis Brown Gaoyue Zhou Shengyi Qian Boyang Zheng, Théophane Vallaeys Junlin Han Rob Fergus Naila Murray Marjan Ghazvininejad Mike Lewis Nicolas Ballas Amir Bar Michael Rabbat Jakob Verbeek Luke Zettlemoyer Koustuv Sinha, PhD Yann LeCun Saining Xie