Integrating Geometric Understanding in Generative Diffusion Models With Text Instructions

Author:
Gupta, Ishita, Computer Science - School of Engineering and Applied Science, University of Virginia
Advisor:
Zhang, Miaomiao, EN-Elec & Comp Engr Dept, University of Virginia
Abstract:

Generative models have revolutionized image synthesis and editing tasks, achieving unparalleled success in semantic and stylistic transformations. While they have come a long way, their ability to learn and apply precise geometric transformations remains limited by explicit conditioning methods and dependence on substantially labeled data. This limitation restricts their ability to adapt to real-world applications requiring spatial accuracy. This thesis presents a novel approach that enables generative models to implicitly learn and apply geometric transformations, particularly rotation, through latent space manipulation. An integrated system is proposed that combines text-guided generation with geometric reasoning, equipping generative models with the ability to learn geometric features and integrate geometric reasoning into the learning pipeline. By combining latent space learning with text-based guidance and diffusion-based denoising, the proposed framework achieves precise and interpretable geometric transformations, specifically focusing on rotations as a proof of concept. The proposed model aims to learn a latent representation that captures the geometric difference between the source and target images. The transformation parameter (rotation angle) is learned directly from the latent space. The latent space, combined with the source image and text embeddings, is refined using a diffusion model, which allows the generative models to implicitly learn and apply geometric transformations. The diffusion model enhances the coherence of the transformed outputs while maintaining consistency with the geometric constraints provided via text prompts. Experimental results demonstrate that the proposed framework can effectively capture and learn rotation transformations. The proposed model outperforms the baseline methods (FastEdit, SDEdit, and InstructPix2Pix) by achieving an FID score of 4.85, IS of 3.53, and SSIM of 0.88, and shows alignment with ground truth transformations. The modularity of the architecture suggests its potential for generalizing transformations beyond rotations and paves the way for more robust generative modeling techniques. It opens avenues for applications in medical imaging, robotics, and augmented reality, where precise and efficient image manipulation is critical.

Degree:
MS (Master of Science)
Keywords:
Generative Diffusion Models, Geometric Transformations, Text-Guided Image Editing, Latent Space Representations, Implicit Transformation Learning
Language:
English
Rights:
All rights reserved (no additional license for public reuse)
Issued Date:
2024/12/09