Abstract
Large language models (LLMs) are increasingly deployed in high-stakes settings such as finance, healthcare, and education, where failures in safety and fairness can cause real harm. In this dissertation, we advance LLM trustworthiness by (i) benchmarking how LLMs introduce explicit risks within the RAG pipeline, (ii) developing red-teaming attacks that expose latent vulnerabilities even under seemingly “safe” conditions, and (iii) proposing robust defenses that strengthen alignment in practice.
First, we study Retrieval-Augmented Generation (RAG), a widely used strategy to reduce hallucinations and improve domain grounding, and show that its benefits are not a free lunch. We propose a practical three-level threat model based on user awareness of fairness, which induces different degrees of “fairness censorship” in external dataset. Across uncensored, partially censored, and fully censored datasets, we find that fairness alignment can be undermined through RAG alone, without any fine-tuning, and that biased outputs can persist even when the external data is ostensibly unbiased. These results reveal fundamental limitations of current alignment when models are coupled with retrieval.
Second, we uncover a complementary vulnerability in the fine-tuning stage: even benign training data can degrade safety. Building on this phenomenon, we develop a stronger red-teaming attack by identifying the most safety-degrading outliers within benign datasets. We introduce Self-Inf-N, an outlier-detection method that selects a small set of influential samples. Fine-tuning on as few as 100 selected outliers can severely compromise safety alignment across seven mainstream LLMs, with high transferability and poor coverage by existing mitigations.
Third, motivated by these failures, we develop reasoning-aware alignment defenses. Using causal intervention, we show that many jailbreak vulnerabilities stem from shallow alignment behaviors that refuse without deeply understanding harmful intent. We address this by constructing a new Chain-of-Thought fine-tuning dataset spanning utility and safety-critical prompts to encourage principled, reasoned refusals. Finally, we propose Alignment-Weighted DPO, which assigns different preference weights to reasoning versus response segments, enabling targeted updates that improve jailbreak robustness while preserving utility.
Together, these studies provide a unified framework for diagnosing and mitigating trustworthiness failures across RAG, fine-tuning, offering actionable guidance for deploying safer and fairer LLM systems.