From multi-modal capture to photo-realistic view synthesis: A high-quality and real-time multiview approach

Daniele Bonatto

Research output: ThesisPhD Thesis

92 Downloads (Pure)

Abstract

Interactive view synthesis is a fundamental task in computer vision, aiming at recreating a
natural scene from any viewpoint using a sparse set of images. The primary focus of this Ph.D.
thesis is to explore the acquisition processes and the ability to render high-quality novel views
dynamically to a user. Furthermore, this research targets real-time rendering as the second
objective.
In this thesis, we explore two different ways a scene can be reconstructed. The first option
is to estimate the three-dimensional (3D) structure of the scene, by means of a set of points
in space called a point cloud (PC). Such PC can be captured by a variety of devices and
algorithms such as Time of Flight (ToF) cameras, stereo matching, or structure-from-motion
(SfM). Alternatively, the scene can be represented by a set of input views with their associated
depth maps, which can be used in depth image-based rendering (DIBR) to synthesize new
images.
We explore depth image-based rendering algorithms, using pictures of a scene and their as-
sociated depth maps. These algorithms project the color values at the novel view position using
the depth information. However, the quality of the depth map highly impacts the accuracy of
the synthesized views. Therefore, we started by improving the Depth Estimation Reference
Software (DERS) of the Moving Picture Experts Group (MPEG), a worldwide standardization
committee for video compression. Unlike DERS, our Reference Depth Estimation (RDE)
software can take any number of input views, leading to more robust results. It is currently
used to generate novel depth maps for standardized datasets.
The depth estimation did not reach real-time generation; it takes minutes to hours to create
a depth map depending on the input views. We therefore explored active depth sensing devices,
such as Microsoft Kinect, to acquire color data and depth maps simultaneously. With the
availability of these depth maps, we address the DIBR problem by providing a novel algorithm
that seamlessly blends several views together. We focus on obtaining a real-time rendering
method; in particular, we exploited the Open Graphics Library (OpenGL) pipeline to rasterize
novel views and customize dynamic video loading algorithms to provide frames from video data
to the software pipeline. The developed Reference View Synthesizer (RVS) software achieves
2x90 frames per second in a head-mounted display while rendering natural scenes. RVS was
initially the default rendering tool in the MPEG-Immersive (MPEG-I) community. Over time, it
has evolved to function as the encoding tool and continues to play a crucial role as the reference
verifcation tool during experiments.
We tested our methods on conventional, head-mounted, and holographic displays. Finally,
we explored advanced acquisition devices and display mediums, such as (1) plenoptic cameras
for which we propose a novel calibration method and an improved conversion to sub-aperture
views, and (2) a three-layers holographic tensor display, able to render multiple views without
wearing glasses.
Each piece of this work contributed to the development of photo-realistic methods; we
captured and open-sourced several public datasets of high quality and precision to the research
community. They are also used by MPEG to develop novel algorithms for the future of
immersive television.
Original languageEnglish
Awarding Institution
  • Vrije Universiteit Brussel
Supervisors/Advisors
  • Munteanu, Adrian, Supervisor
  • Lafruit, Gauthier, Supervisor
Award date7 Mar 2024
Publication statusPublished - 2024

Fingerprint

Dive into the research topics of 'From multi-modal capture to photo-realistic view synthesis: A high-quality and real-time multiview approach'. Together they form a unique fingerprint.

Cite this