Skip to content

dschw02/Estimating-Point-Clouds-from-Single-View-Depth-Maps

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Masters Thesis - ReadMe

Source of the project

The Basic Idea of this project is to use an Idea proposed by Chen-Hsuan Lin et al. to Learn 3d Point-Clouds from Single Image RGB-data. Their approach was split into two steps.

  1. Learning depth-maps of the projected 3d object from multiple viewpoints
  2. Refining the results by reconstructing the 3d object from the predicted depth information and rendering it from different viewpoints

They basically learned 3d representations without a need to store or load the ground truth representation of the objects themselves, and implicitely added constraints directly based on the geometry of the object to a purely two dimensional training process. Since there has been a lot of development on 2d convolution in the past, the first step can be altered to any network architecture that has proven itself over the years.

1.

All images used are rendered from models in the ShapeNet dataset, using the code provided in this repository

a

a a a

 

For the current use of the concept in this project, depth synthesis is performed instead of estimation of depth from RGB-images. By taking away one of the implicit steps between single-image and multiview-depth, the task of generating 3d objects from projection becomes one of shape completion. The currently implemented network architecture for this step is a basic encoder/decoder structure using learned up- and down-sampling and a low-dimensional latent layer, which makes this project perfectly applicable for usage in GANs.

The exact specifications of the convnet for the shown examples are:

Layer#filtersfilter sizeactivationshape
conv_2d256(3, 3)leaky_relu(32, 32)
conv_2d512(3, 3)leaky_relu(16, 16)
conv_2d1024(3, 3)leaky_relu(8, 8)
fully_connected4096-leaky_relu(4096)
fully_connected2048-leaky_relu(2048)
fully_connected1024-leaky_relu(1024)
fully_connected2048-leaky_relu(2048)
fully_connected4096-leaky_relu(4096)
conv_2d_transpose1024(3, 3)leaky_relu(8, 8)
conv_2d_transpose512(3, 3)leaky_relu(16, 16)
conv_2d_transpose256(3, 3)leaky_relu(32, 32)
conv_2d128(3, 3)leaky_relu(32, 32)
conv_2d64(3, 3)leaky_relu(32, 32)
conv_2d32(3, 3)linear(32, 32)

The number of output channels (32) are double the chosen number of viewpoints, since masks are learned along the way to refine the rendering process in the second step. Using only this NN, the results on 80% of all chairs in the ShapeNet dataset look like this

  • Examples randomly picked, all available for review
  • All predictions are from an evaluation set never seen by the network during training

As can be seen, the results of the net itself are close to ground truth on first sight. Yet, when projecting the predicted depth maps into 3d space, the results are subject to noise. The following point-cloud is the result of projection of the left most chairs projected predictions

2.

Since the aim of the original work and this thesis is to predict 3d point-clouds, and not to transform viewpoints of depth maps, these results are not what is needed. To improve upon the results of step 1, the authors introduced two integral components called "structure generator" and "pseudo renderer". These two concepts are pointwise orthogonal projection from 2d into 3d space and pointwise orthogonal depth rendering respectively.

aaa

The addition of the structure generator and pseudo rendering steps are completely static and can be added directly after the output layer of the pretrained network. Additionaly, the loss is modified to only penalize depth where the mask predicted a hit while the loss function calculates the loss of the mask as crossentropy. The results after the extension of the net look as follows.

While the results look worse at first, the back projection of the predicted depth reveals a point cloud that captured the essential details of the model.

 

Bottlenecks and the main improvement

While projecting the depth maps predicted in the first step can be easily implemented as matrix-matrix multiplication. In order to render the generated point-cloud, the first problems arise. While the max reduction over every pixel of the output image can be calculated rather quick and without further problems, the TensorFlow reduce_max function does not provide a gradient. There does not exist a built-in function or quick workaround for this problem, so the authors came up with a combination of scattering the data to a tensor much larger than the output image and pooling the scattered values. points_in_grid_3 Original Resolution (l.), Double Resolution (m.), Quadruple Resolution(r.)

As long as every pixel is only hit once during the scattering process, using max pooling to sample down to the wanted output resolution does not produce any errors. Unfortunately, to generally distribute indexation more evenly, instead of only learning single channel depth values, the x and y indices are learned in the same net as depth and mask and therefore double the memory requirement of the NN itself. Additionaly, the upscale factor described above (U) has to be chosen rather large with at least a value of 30 to 35 to avoid collisions. Therefore the rendering process takes up most of the available memory, while the network has to be comparably small.

I developed a strategy to halve the NNs memory consumption and at least thirds of the usage in the pseudo rendering stage.

1. Reducing channels in the NN

Instead of learning mask values, depth values, x indices and y indices, the last two are cut to free up memory. This directly affects the scattering process since the number of collisions increases significantly. Since collision is handled by addition by default, and can't be substituted by more useful operations, another way of handling multiple datapoints for the same pixel had to be found.

2. A resourceful scattering scheme for max reducing

Since the index values are now computed deterministically, we can't rely on the structure of the data anymore. To solve this issue, I decided to arrange all datapoints in a 6-Tuple [batch index, viewpoint index, x index, y index, z value, mask value]. By sorting these points by multiple columns, a structure can be achieved that clusters x indices related to the same depth map in the same viewpoint while seperating them in the tensor itself. Since sorting data of more than one dimension is not supported by TensorFlow, a workaround has been implemented, that fully supports gradient flow and is therefore applicable for training purposes.

Sorting

Red marks collisions of indices after sorting by column [3,2,1,0]in the given order

To be able to scatter the sorted data to a dense grid, all rows indices should be unique. To achieve this, the x indices are padded to support up to U' different values per individual output map. This is achieved by multipling the x index tensor by U' and padding between the values.

update_ind

If U' was chosen big enough, all indices are now unique and the data can be scattered to a tensor of shape [dim_x * U', dim_y] with the advantage, that there need to be more than U' collisions in the original data before any harmful writes can be performed. The original method was based on probability and took a lot more space to perform reliably. For image dimensions of 32x32 the needed value of U' lies well below 100 to perform as well as the old method for U in a range between 30 and 40, which means U² elements per pixel.

The additional space gained by using this approach can be used to perform on higher image dimensions, which has yet to be tested thoroughly. The preliminary results on 32x32 pixel and under 12 hours of training on a NVIDIA GTX 1070 (8GB RAM) are more than promising.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published