Abstract
This paper reveals the untapped potential of the U-ViT architecture in
diffusion models. The study initially explores the contribution of the
U-ViT architecture in the visual generation task of multimodal diffusion
models and proposes an improvement scheme, â\euro?FreeVâ\euro?,
specifically designed for the U-ViT architecture â\euro“ marking the
first application of the U-Net-based FreeU enhancement framework within
the Transformer architecture. The FreeV framework significantly enhances
generation quality without requiring additional training or fine-tuning.
The key insight of this study lies in balancing the contributions from
the backbone network, skip connections, and fused feature maps within
the U-ViT to fully leverage the advantages of both components while
circumventing the limitations of feature fusion in U-ViT.Project page:
https://github.com/GoldenFishes/FreeVÂ