On the Alignment of Multimodal Foundation Models

Qi, Daiqing

On the Alignment of Multimodal Foundation Models 44 views

Author

Qi, Daiqing, School of Data Science, University of Virginia

Advisors

Li, Sheng , School of Data Science , University of Virginia

Abstract

Multimodal understanding is a central capability on the path toward AGI, enabling models to integrate perception and reasoning across diverse modalities. Within this space, vision–language understanding plays a foundational role and supports a wide range of real-world applications such as AI-assisted creative tools, multimodal personal assistants, etc. Foundation Vision–Language Mod- els (VLM) and Multimodal Large Language Models (MLLMs) form the backbone of these systems. VLMs aim to learn multimodal representations for downstream tasks such as retrieval, segmenta- tion, as well as for building multimodal models such as LLaVA. Meanwhile, MLLMs aim to perform complex reasoning over visual representations. At a high level, robust multimodal reasoning can be decomposed into three key components 1. learning better visual representations from backbone vision encoders, 2. aligning them effectively with LLM representations, and 3. enabling stronger reasoning over the aligned features. My research focuses on the three key components. (1) To enhance visual representations, I propose TEAM, which uses language-driven augmentations to im- prove CLIP’s cross-domain generalization, and E2, which integrates token-fusor layers into the CLIP vision encoder to capture finer-grained visual distinctions. (2) To improve multimodal alignment, I develop TUNA, which retrieves images similar to a given input and uses their tag information to enhance an MLLM’s understanding of the visual representations. I also propose a multi-view image fusor that combines multiple vision encoders for stronger visual comprehension for MLLMs. (3) To advance complex visual reasoning in more practical scenarios, I introduce Video Contrastive Decoding, which exposes erroneous temporal reasoning to MLLMs by perturbing visual features, serving as negative targets during the generation of correct responses. We further propose AVIS for general multi-image understanding, uncovering positional sensitivity, and improving cross-image reasoning. Together, they contribute to the visual alignment of multimodal foundation models.

Degree

PHD (Doctor of Philosophy)

Keywords

Multimodal Learning; Machine Learning; Deep Learning; Large Language Models; Vision Language Models; Computer Vision; Natrual Language Processing

Language

English

Rights

Issued Date

2026-05-01

Persistent Link

https://doi.org/10.18130/240v-5713

Suggested Citation

Qi, Daiqing. On the Alignment of Multimodal Foundation Models. University of Virginia, School of Data Science, PHD (Doctor of Philosophy), 2026-05-01, https://doi.org/10.18130/240v-5713.

Files

1_Qi_Daiqing_2026_PHD.pdf

Downloads: 16

Download