This is a paper about a new way of storing/merging 3D data.
The actual 3D reconstruction is so-so, I agree. And they kinda cheat by using ARKit (which uses LIDAR internally) to get good camera poses even if there is little texture.
So the novel part here is that they can immediately merge all the images into a coherent representation of the 3D space, as opposed to first doing bundle adjustment, then doing pairwise depth matching, then doing streak-based depth matching, and then merging the resulting point clouds.
Also, they can use learned 3D shape priors to improve their results. Basically that means "if there is no visible gap, assume the surface is flat". But AFAIK, that's not new.
EDIT: My main criticism of this paper after looking at the source code a bit would be that due to the TSDF, which is like a 3D voxel grid, they need insane amounts of GPU memory, or else the scenes either need to be very small or low resolution. That is most likely also the reason why they reconstruction looks so cartoon-like and is smooth on all corners: They lack the memory to store more high-frequency details.
EDIT2: Mainly, it looks like they managed to reduce the GPU memory consumption of Atlas [1] which is why they can reconstruct larger areas and/or higher resolution. But it's still far less detail than Colmap [2].
> TSDF memory isn’t an issue since Niessner et al. (2013).
I would strongly disagree. This paper uses TSDF and runs into memory issues. And ATLAS is using TSDF and running into memory issues. So for practical applications, TSDF is still too memory hungry.
Try out our app, Metascan, to see an example of using TSDF with a multi-resolution GPU hashtable that only stores voxel data near surfaces. Or just skim the original voxel hashing paper from 2013 to understand the technique.
Storing voxel data in an array is a lot simpler. So if it’s not the focus of the research, then why would academics engineer something more complex?
The actual 3D reconstruction is so-so, I agree. And they kinda cheat by using ARKit (which uses LIDAR internally) to get good camera poses even if there is little texture.
So the novel part here is that they can immediately merge all the images into a coherent representation of the 3D space, as opposed to first doing bundle adjustment, then doing pairwise depth matching, then doing streak-based depth matching, and then merging the resulting point clouds.
Also, they can use learned 3D shape priors to improve their results. Basically that means "if there is no visible gap, assume the surface is flat". But AFAIK, that's not new.
EDIT: My main criticism of this paper after looking at the source code a bit would be that due to the TSDF, which is like a 3D voxel grid, they need insane amounts of GPU memory, or else the scenes either need to be very small or low resolution. That is most likely also the reason why they reconstruction looks so cartoon-like and is smooth on all corners: They lack the memory to store more high-frequency details.
EDIT2: Mainly, it looks like they managed to reduce the GPU memory consumption of Atlas [1] which is why they can reconstruct larger areas and/or higher resolution. But it's still far less detail than Colmap [2].
[1] https://github.com/magicleap/Atlas
[2] https://colmap.github.io/