Blue and Green represent ground truth labels and our predictions, respectively.
Input Panorama Our Predictions
Inherent ambiguity in layout annotations poses significant challenges to developing accurate 360° room layout estimation models. To address this issue, we propose a novel Bi-Layout model capable of predicting two distinct layout types. One stops at ambiguous regions, while the other extends to encompass all visible areas. Our model employs two global context embeddings, where each embedding is designed to capture specific contextual information for each layout type. With our novel feature guidance module, the image feature retrieves relevant context from these embeddings, generating layout-aware features for precise bi-layout predictions.
A unique property of our Bi-Layout model is its ability to inherently detect ambiguous regions by comparing the two predictions. To circumvent the need for manual correction of ambiguous annotations during testing, we also introduce a new metric for disambiguating ground truth layouts. Our method demonstrates superior performance on benchmark datasets, notably outperforming leading approaches. Specifically, on the MatterportLayout dataset, it improves 3DIoU from 81.70% to 82.57% across the full test set and notably from 54.80% to 59.97% in subsets with significant ambiguity.
Prior works generate the single layout prediction from the panorama image. Given the single panorama as the input, the state-of-the-art method only generates a single prediction, which faces wrong prediction parts inside the white box. This challenge comes from the inherent ambiguity region in the dataset label, which is caused by inconsistent annotation strategy and is hard to solve by a single prediction method.
The white box indicates the ambiguous regions and the prediction from the state-of-the-art methods are struggled in these regions. We further define the two types of ground truth annotations, enclosed and extended. The enclosed annotation encloses the nearest room. The extended annotation extends to all visible areas. With two label definitions, our model has a clear target to learn and predict.
Inherent ambiguity in the MatterportLayout Dataset. Blue and Green represent ground truth annotations and predictions from the SoTA models, respectively. The layout boundaries are shown on the left, and their bird's-eye view projections are on the right. We define two types of layout annotation: (a) enclosed type encloses the room. (b) extended type extends to all visible areas. The dashed lines underscore the ambiguity in the dataset labels.
We propose our Bi-Layout model, which can generate two distinct layouts from a single panorama. Within these two layout predictions, we can find a prediction that best fits the dataset label to obtain better performance and solve the ambiguity issue inherent in the dataset label.
We proposed three main parts to construct our bi-layout model.
(a) The first part is the Feature Extractor to encode the panorama. We use ResNet to generate different levels of features and use our simplified height compression module to compress and concatenate these features into an efficient 1-dimensional feature representation.
(b) The second part is our Global Context Embeddings, which learn the global contextual information from the dataset labels. These embeddings learn the contextual information during training time.
(c) The third part is our Shared Feature Guidance Module, which guides the shared panorama feature with corresponding global context embedding and generates the final feature for different predictions. This is the most important part of our model design to generate the two distinct layout predictions.
Full set and Subset evaluation. Equivalent branch represents the output, which is trained with the same label as baseline methods. Disambiguate is our proposed metric.
Qualitative comparison on the MatterportLayout (top) and ZInd datasets (bottom). Blue and Green represent ground truth labels and predictions, respectively. The boundaries of the room layout are on the left, and their bird's eye view projections are on the right. We show our disambiguate results, which effectively address the ambiguity issue, while the SoTA methods struggle with the ambiguity, as highlighted in dashed lines.
@inproceedings{tsai2024no, title={No more ambiguity in 360◦ room layout via bi-layout estimation}, author={Tsai, Yu-Ju and Jhang, Jin-Cheng and Zheng, Jingjing and Wang, Wei and Chen, Albert and Sun, Min and Kuo, Cheng-Hao and Yang, Ming-Hsuan}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, year={2024} }