FreeV:Free Lunch in MultiModal Diffusion U-ViT

Zhou Qiangong; Youyu Zhou; Yahong Wang

doi:10.36227/techrxiv.24633840.v1

loading page

FreeV:Free Lunch in MultiModal Diffusion U-ViT

Zhou Qiangong ,
Youyu Zhou ,
Yahong Wang

Abstract

This paper reveals the untapped potential of the U-ViT architecture in diffusion models. The study initially explores the contribution of the U-ViT architecture in the visual generation task of multimodal diffusion models and proposes an improvement scheme, â\euro?FreeVâ\euro?, specifically designed for the U-ViT architecture â\euro“ marking the first application of the U-Net-based FreeU enhancement framework within the Transformer architecture. The FreeV framework significantly enhances generation quality without requiring additional training or fine-tuning. The key insight of this study lies in balancing the contributions from the backbone network, skip connections, and fused feature maps within the U-ViT to fully leverage the advantages of both components while circumventing the limitations of feature fusion in U-ViT.Project page: https://github.com/GoldenFishes/FreeVÂ