Abstract:
In recent years, text-to-image generation has gained a lot of attention. Diffusion models have
been proven to be the state-of-the-art in this domain. However, due to their high computational
demands, most of the current research focuses on improving their efficiency and image quality.
It has also been identified that existing text-to-image solutions have very limited usability and
applicability due to their lack of control. This project aims to address the lack of control and
customization in text-to-image diffusion models by developing a solution that enhances their
controllability and customizability.
This Project proposes a unified architecture and pipeline that combines multiple fine-tuning
techniques to enables both subject personalization and conditional control. Subject
personalization allows for customized image generation of specific subjects, and conditional
control enables the diffusion model to utilize conditioning images during the image generation
process. The diffusion model must be fine-tuned with multiple datasets to enable these
techniques.
The prototype implementation successfully demonstrates the core functionalities of the
proposed solution. Based on the qualitative self-evaluation, the implemented architecture and
pipeline demonstrates the primary fine-tuning techniques with satisfactory results. The fine
tuned latent diffusion model utilised in the prototype achieved a quantitative CLIP Score of 71.15