Restoring facial details from low-quality (LQ) images has remained a challenging problem due to its ill-posedness induced by various degradations in the wild. The existing codebook prior mitigates the ill-posedness by leveraging an autoencoder and learned codebook of high-quality (HQ) features, achieving remarkable quality. However, existing approaches in this paradigm frequently depend on a single encoder pre-trained on HQ data for restoring HQ images, disregarding the domain gap between LQ and HQ images. As a result, the encoding of LQ inputs may be insufficient, resulting in suboptimal performance.
To tackle this problem, we propose a novel dual-branch framework named DAEFR. Our method introduces an auxiliary LQ branch that extracts crucial information from the LQ inputs. Additionally, we incorporate association training to promote effective synergy between the two branches, enhancing code prediction and output quality. We evaluate the effectiveness of DAEFR on both synthetic and real-world datasets, demonstrating its superior performance in restoring facial details.
(a) Existing codebook prior approaches learn an encoder in the first stage. During the restoration stage, these approaches utilize LQ images to fine-tune the encoder using pre-trained weights obtained from HQ images. However, this approach introduces a domain bias due to a domain gap between the encoder and LQ input images.
(b) We propose the integration of an auxiliary branch specifically designed for encoding LQ information. This auxiliary branch is trained exclusively using LQ data to address domain bias and obtain precise feature representation. Furthermore, we introduce an association stage and feature fusion module to enhance the integration of information from both encoders and assist our restoration pipeline.
(a) Initially, we train the autoencoder and discrete codebook for both HQ and LQ image domains through self-reconstruction.
(b) Once we obtain both encoders, we divide the feature into patches and construct a similarity matrix Massoc that associates HQ and LQ features while incorporating spatial information. To promote maximum similarity between patch features, we employ a cross-entropy loss function to maximize the diagonal of the matrix.
(c) After obtaining the associated encoders, we use a multi-head cross-attention module (MHCA) to merge the features from the associated encoders, generating fused features. We then input the fused feature to the transformer, which predicts the corresponding code index s for the HQ codebook. Finally, we use the predicted code index to retrieve the features and feed them to the HQ decoder to restore the image.
We compare SoTA methods on the real-world datasets. Our DAEFR method exhibits robustness in restoring high-quality faces even under heavy degradation.
@inproceedings{tsai2024dual, title={Dual Associated Encoder for Face Restoration}, author={Tsai, Yu-Ju and Liu, Yu-Lun and Qi, Lu and Chan, Kelvin CK and Yang, Ming-Hsuan}, booktitle={The Twelfth International Conference on Learning Representations}, year={2024} }