Abstract
Multi-modal learning has become a central paradigm in modern artificial intelligence, enabling models to integrate and reason across heterogeneous modalities such as vision and language. Large-scale foundation models—including CLIP, BLIP, and GPT-4V—have demonstrated impressive capabilities in cross-modal understanding and generation. However, their deployment in real-world applications raises significant concerns regarding trustworthiness: these models may fail under distribution shifts, exhibit unintended behavioral changes during adaptation, provide limited transparency for decision-making, and offer insufficient control over generated outputs.
This dissertation targets the overarching goal of enabling trustworthy deployment of large multi-modal foundation models. We pursue this goal through four connected objectives. First, models must align with up-to-date information as knowledge evolves. We propose a dynamic model editing framework that balances generality and locality, enabling precise knowledge updates in multi-modal models while preserving broader capabilities. Second, because adaptation and editing create new attack surfaces, trustworthy deployment requires the ability to detect safety risks introduced by model modification. We analyze the security implications of model editing, demonstrating that large pre-trained models can be vulnerable to rapid backdoor injection and highlighting risks caused by uncontrolled adaptation. Third, trustworthy systems must support human oversight by enabling users to interpret model outputs. We improve interpretability in vision–language reasoning through prompt-based approaches that align visual and textual distributions, yielding more transparent and explainable fine-grained image classification. Finally, even when a model’s knowledge is correct, deployed systems must control its reasoning and outputs to match user intent and task-specific requirements. Building upon these foundations, we introduce Guided Decoding, an inference-time controllability mechanism that mitigates the “expert-default bias” of multi-modal language models by modulating reasoning proficiency during generation, enabling fine-grained behavioral control without modifying the underlying parameters.
In summary, these contributions provide algorithmic tools and conceptual insights that collectively advance the alignment, safety-risk detection, interpretability, and controllability of large multi-modal foundation models, supporting more reliable deployment of trustworthy AI systems in real-world applications.