CompleteMe: Reference-based Human Image Completion

1UC Merced, 2Adobe Research

Motivation

Given occluded human image, non-reference methods, LOHC and BrushNet, can generate plausible results but lack the unique information of the person like special clothing and tattoo pattern (highlighted in Red box). Such information can be only acquired by additional reference images. Given the reference image, MimicBrush fails to find the corresponding parts between input and reference. Our CompleteMe can preserve identical and fine-detail information from the reference image and generate a consistent result.

Abstract

Recent methods for human image completion can reconstruct plausible body shapes but often fail to preserve unique details, such as specific clothing patterns or distinctive accessories, without explicit reference images. Even state-of-the-art reference-based inpainting approaches struggle to accurately capture and integrate fine-grained details from reference images. To address this limitation, we propose CompleteMe, a novel reference-based human image completion framework. CompleteMe employs a dual U-Net architecture combined with a Region-focused Attention (RFA) Block, which explicitly guides the model's attention toward relevant regions in reference images. This approach effectively captures fine details and ensures accurate semantic correspondence, significantly improving the fidelity and consistency of completed images. Additionally, we introduce a challenging benchmark specifically designed for evaluating reference-based human image completion tasks. Extensive experiments demonstrate that our proposed method achieves superior visual quality and semantic consistency compared to existing techniques.

Method


Our proposed CompleteMe utilizes a dual U-Net framework composed of a Reference U-Net and a Complete U-Net.

Given an input image with masked regions, we first encode the input image to latent feature. The Reference U-Net then extracts detailed visual features from multiple reference images, which consist of different human body parts. Along with global semantic features extracted by CLIP, the reference features are processed within our novel Region-focused Attention (RFA) Block embedded in the Complete U-Net. These reference features are then explicitly masked according to reference masks, producing masked reference features. This explicit masking and concatenation strategy enables the model to precisely zoom in and focus on relevant human regions, establishing accurate and fine-grained correspondences through the Region-focused Attention mechanism. Finally, decoupled cross-attention integrates these refined local features with the global semantic CLIP features, resulting in a detailed and semantically coherent completion.

Visual Comparison


Visual Comparison with Non-reference Methods. We compare CompleteMe with non-reference methods, LOHC and BrushNet. Given masked inputs, these non-reference methods generate plausible content for the masked regions using image priors or text prompts. However, as indicated in the Red box, they cannot reproduce specific details such as tattoos or unique clothing patterns, as they lack reference images to guide the reconstruction of identical information.


Visual Comparison with Reference-based Methods. Our CompleteMe can generate more realistic and preserve identical information from the reference image. Please refer to the Red box region for a more detailed comparison. For more results, please refer to the supplementary material.

BibTeX

@article{tsai2025completeme,
        title={CompleteMe: Reference-based Human Image Completion},
        author={Tsai, Yu-Ju and Price, Brian and Liu, Qing and Figueroa, Luis and Pakhomov, Daniil and Ding, Zhihong and Cohen, Scott and Yang, Ming-Hsuan},
        journal={arXiv preprint arXiv:2504.20042},
        year={2025}
      }