Abstract
Recent advances in computer vision and image analysis have demonstrated the power of deep networks that explicitly model geometric shape information of deformable objects, driving progress across diverse tasks including classification, generation, reconstruction, and segmentation. However, effective shape-based representation learning faces several critical challenges. Many existing methods struggle to accurately represent underlying training data distributions, sampling geometric transformations from random distributions without incorporating group-specific prior information, generating transformations that violate the topological integrity or fail to capture the multimodal nature of real-world object deformations. Furthermore, even when such representations are learned, integrating them with texture-based features remains an open challenge, as existing models rely on simplistic fusion strategies that fail to exploit cross-modal dependencies between shape and texture. Critically, such learned representations remain susceptible to conflating invariant and spurious correlations, limiting generalization across diverse environments. Alongside this, models struggle to capture fine-grained differences between similar objects while relying on reference images at inference time, undermining both diagnostic accuracy and practical feasibility. These limitations become more pronounced when confronted with pathological appearance changes such as tumors or lesions that disrupt anatomical correspondences. Addressing these challenges requires frameworks that learn robust geometric representations, leverage complementary visual cues, and generalize across diverse clinical scenarios.
My research focuses on developing advanced deep learning frameworks that learn robust and complex shape representations from medical image data and integrate them into the current paradigm of image appearance and texture learning; thereby maximizing their impact on a broad range of medical image analysis tasks. To achieve this, the dissertation presents (i) a multimodal geometric augmentation framework that learns expressive latent deformation distributions from groupwise images within a variational autoencoder, generalizing beyond predefined random transformations; (ii) a cross-modal fusion architecture that jointly learns the interaction between deformable shape and texture representations in a shared latent space, explicitly modeling their interdependencies across temporal phases; (iii) a representation learning model that jointly learns invariant shape and texture features in an integrated latent space, robust to spurious correlations across diverse training environments; (iv) a discriminative shape network that captures fine-grained differences between similar objects without relying on reference images at inference time; and (v) an uncertainty-guided disentangled deformation framework that preserves anatomical topology under significant pathological appearance changes by jointly disentangling geometric and appearance representations. These contributions establish a new paradigm of shape-aware learning that significantly enhances robustness, generalization, and interpretability across a wide range of imaging applications.