Abstract
Multimodal understanding is a central capability on the path toward AGI, enabling models to integrate perception and reasoning across diverse modalities. Within this space, vision–language understanding plays a foundational role and supports a wide range of real-world applications such as AI-assisted creative tools, multimodal personal assistants, etc. Foundation Vision–Language Mod- els (VLM) and Multimodal Large Language Models (MLLMs) form the backbone of these systems. VLMs aim to learn multimodal representations for downstream tasks such as retrieval, segmenta- tion, as well as for building multimodal models such as LLaVA. Meanwhile, MLLMs aim to perform complex reasoning over visual representations. At a high level, robust multimodal reasoning can be decomposed into three key components 1. learning better visual representations from backbone vision encoders, 2. aligning them effectively with LLM representations, and 3. enabling stronger reasoning over the aligned features. My research focuses on the three key components. (1) To enhance visual representations, I propose TEAM, which uses language-driven augmentations to im- prove CLIP’s cross-domain generalization, and E2, which integrates token-fusor layers into the CLIP vision encoder to capture finer-grained visual distinctions. (2) To improve multimodal alignment, I develop TUNA, which retrieves images similar to a given input and uses their tag information to enhance an MLLM’s understanding of the visual representations. I also propose a multi-view image fusor that combines multiple vision encoders for stronger visual comprehension for MLLMs. (3) To advance complex visual reasoning in more practical scenarios, I introduce Video Contrastive Decoding, which exposes erroneous temporal reasoning to MLLMs by perturbing visual features, serving as negative targets during the generation of correct responses. We further propose AVIS for general multi-image understanding, uncovering positional sensitivity, and improving cross-image reasoning. Together, they contribute to the visual alignment of multimodal foundation models.