Holistic indoor scene understanding, modelling and reconstruction from single images.

Nie, Yinyu

3D indoor scene understanding in computer vision refers to perceiving the semantic and geometric information in a 3D indoor environment from partial observations (e.g. images or depth scans). Semantics in a scene generally involves the conceptual knowledge such as the room layout, object categories, and their interrelationships (e.g. support relationship). These scene semantics are usually coupled with object and room geometry for 3D scene understanding, for example, layout plan (i.e. location of walls, ceiling and floor), shape of in-room objects, and a camera pose of observer. This thesis focuses on the problem of holistic 3D scene understanding from single images to model or reconstruct the in- door geometry with enriched scene semantics. This challenging task requires computers to perform equivalently as human vision system to perceive and understand indoor contents from colour intensities. Existing works either focus on a sub-problem (e.g. layout estimation, 3D detection or object reconstruction), or ad- dressing this entire problem with independent subtasks, while this thesis aims to an integrated and unified solution toward semantic scene understanding and reconstruction. In this thesis, scene semantics and geometry are regarded inter- twined and complementary. Understanding each part (semantics or geometry) helps to perceive the other one, which enables joint scene understanding, modelling & reconstruction. We start by the problem of semantic scene modelling. To estimate the object semantics and shapes from a single image, a feasible scene modelling streamline is proposed. It is backboned with fully convolutional networks to learn 2D semantics and geometry, and powered by a top-down shape retrieval for object modelling. After this, We build a unified and more efficient visual system for semantic scene modelling. Scene semantics are divided into relational (i.e. support relationship) and non-relational (i.e. object segmentation & geometry, room layout) knowledge. A Relation Network is proposed to estimate the support relations between objects to guide the object modelling process. Afterwards, We focus on the problem of holistic and end-to-end scene understanding and reconstruction. Instead of modelling scenes by top-down shape retrieval, this method bridges the gap between scene understanding and object mesh reconstruction. It does not rely on any external CAD repositories. Camera poses, room lay- out, object bounding boxes and meshes are end-to-end predicted from an RGB image with a single network architecture. At the end, We extend our work by using a different input modality, single-view depth scan, to explore the object reconstruction performance. A skeleton-bridged approach is proposed to predict the meso-skeleton of shapes as an intermediate representation to guide surface reconstruction, which outperforms the prior-arts in shape completion. Overall, this thesis provides a series of novel approaches towards holistic 3D indoor scene understanding, modelling and reconstruction. It aims at automatic 3D scene perception that enables machines to understand and predict 3D contents as human vision, which we hope could advance the boundaries of 3D vision in machine perception, robotics and Artificial Intelligence.

Item Type:	Thesis (Doctoral)
Additional Information:	If you feel that this work infringes your copyright please contact the BURO Manager.
Data available from BORDaR:	https://doi.org/10.18746/bmth.data.00000155
Group:	Faculty of Media & Communication (Until 31/07/2025)
ID Code:	35404
Deposited By:	Symplectic RT2
Deposited On:	16 Apr 2021 15:19
Last Modified:	14 Mar 2022 14:27

Holistic indoor scene understanding, modelling and reconstruction from single images.

Abstract

Downloads