Controllable Diffusion Models for Fine-Grained Image Editing via Prompt-Guided Semantic Inpainting
Keywords:
Diffusion models, image editing, semantic imprinting, prompt-guided editing, cross-attention, mask encoder, dynamic prompt tokensAbstract
Image synthesis has been turned on its head by models that use diffusion, but the majority fail to enable fine-grained editing that is guided by user purpose. PromptEditDiff is a lightning-fast new prompt-based method of performing fine-grained image editing through semantic imprinting. Our model is a synthesis of cross-attention mechanism, specially designed mask encoder and dynamic prompt tokens that allow pre-defined, high precision region-specific modifications in reaction to text prompts. Significant experiments on CelebA-HQ and COCO datasets demonstrate that PromptEditDiff is dramatically better compared to state-of-art baselines both in terms of photorealism and prompt alignment with FID decreasing to 6.2 and the metric which indicates that 84.7% of humans prefer it over the baselines. Objective measurements as well as end user studies make clear that PromptEditDiff can be used to edit images more accurately, more intuitively, and in a more controlled manner- going forward users can easily put this prominence based editing capability of centering around text based visual content.