Honestly this doesn't look any better than what we were doing back in 2016-2017. I'm not sure what's novel here.
This is the only video I could find, but we were doing monocular reconstruction from a limited number of RGB (not depth) images AND doing voxel segmentation on the processing side. https://www.youtube.com/watch?v=nqy44VSWh3g
Even as far back as 2010 people were doing reasonable monocular reconstruction including software like meshroom etc...the whole of TU Munich also under Matthias Niessner has been doing this for a while.
This is a paper about a new way of storing/merging 3D data.
The actual 3D reconstruction is so-so, I agree. And they kinda cheat by using ARKit (which uses LIDAR internally) to get good camera poses even if there is little texture.
So the novel part here is that they can immediately merge all the images into a coherent representation of the 3D space, as opposed to first doing bundle adjustment, then doing pairwise depth matching, then doing streak-based depth matching, and then merging the resulting point clouds.
Also, they can use learned 3D shape priors to improve their results. Basically that means "if there is no visible gap, assume the surface is flat". But AFAIK, that's not new.
EDIT: My main criticism of this paper after looking at the source code a bit would be that due to the TSDF, which is like a 3D voxel grid, they need insane amounts of GPU memory, or else the scenes either need to be very small or low resolution. That is most likely also the reason why they reconstruction looks so cartoon-like and is smooth on all corners: They lack the memory to store more high-frequency details.
EDIT2: Mainly, it looks like they managed to reduce the GPU memory consumption of Atlas [1] which is why they can reconstruct larger areas and/or higher resolution. But it's still far less detail than Colmap [2].
> TSDF memory isn’t an issue since Niessner et al. (2013).
I would strongly disagree. This paper uses TSDF and runs into memory issues. And ATLAS is using TSDF and running into memory issues. So for practical applications, TSDF is still too memory hungry.
Try out our app, Metascan, to see an example of using TSDF with a multi-resolution GPU hashtable that only stores voxel data near surfaces. Or just skim the original voxel hashing paper from 2013 to understand the technique.
Storing voxel data in an array is a lot simpler. So if it’s not the focus of the research, then why would academics engineer something more complex?
I haven't read this paper in detail. At a glance, it seems interesting, but not necessarily a huge breakthrough compared to SOTA in monocular 3D reconstruction (which is not a criticism; most papers are incremental and that's fine). That said, what I find with a lot of work on neural techniques applied to 3D reconstruction/VSLAM (a pretty active area in the past few years) is that they're often not significantly more accurate than prior, non-ML techniques, and in the beginnings, often less accurate. On the other hand, they're often more robust. Classical geometry- and optimization- based approaches are fairly brittle, and tend to fail catastrophically or converge to totally implausible solutions in tricky conditions. Learning-based approaches tend to be better at degrading gracefully.
Its not a fair fight but your example is much much worse. The mesh reconstruction of the chairs are not suitable for occlusion or physics in your example.
6D.ai wasn’t released until 2018. You may be thinking of Abound, which you wanted to license in March 2017. I don’t know of anyone else that had real-time meshing in 2017.
That's impressive, in that it handles white walls well. Most of the algorithms for that sort of thing have trouble registering big uniform surfaces. That's why many tracking algorithms want objects with lots of texture detail. Some add texture by projecting a noise pattern on the surface being tracked. That's sometimes called "unstructured light", the opposite being projecting a structured pattern such as a grid. The first generation Kinect did that.
Of course, this network may do well on this because it's trained on indoor scenes with walls of uniform height and width. Uniform implies flat is likely to emerge as an assumption. You get to see the wall/floor joints and the wall/wall joints, so you have some references.
Remember the Tesla that hit the big white semitrailer because the algorithm couldn't measure depth to a uniform surface? This is a hard problem in unstructured situations.
Very impressive work. I have not seen any use cases for online 3D reconstruction unfortunately. 6D.ai made terrific progress in this tech but also could not find great use cases for online reconstruction and ended up having to sell to Niantic.
Seems like what people want, if they want 3D reconstruction, is extremely high-fidelity scans (a la Matterport) and are willing to wait for the model. Unfortunately TSDF approach create a "slimey" end look which isn't usually what people are after if they want an accurate 3D reconstruction.
It SEEMS like online 3D reconstruction would be helpful, but I have yet to see a use case for "online"...
I'm very curious to see how well this would work for online terrain reconstruction. I've got a drone with a pretty powerful onboard computer and it's always nice to be able to solve and tune problems with software instead of additional (e.g. LIDAR) hardware.
Does anyone know what the state of the art is for doing this type of reconstruction as a streaming input to detection and recognition algorithms? For instance, this could be used for object detection and identification on a recycling conveyor line.
I don't believe that either does reconstruction... but for the recycling application, there are a handful of companies tackling this problem -- e.g. Everest Labs & Amp Robotics.
The big name in this is Bulk Handling Systems.[1] They build and sell large complete recycling plants. They do about 90% of the sorting with a series of mechanical processes. The robots do only the "quality control" step, pulling out items that got past the choppers, shakers, screens, air blasts, and magnets. That gets the outputs up to 99%+ of the desired product, so it can go into recycling processes that take in cardboard, plastics sorted by type, etc. It makes the robotics cost effective.
San Francisco's recycling operation, at Pier 95, uses a system built by them.
It's not a glamorous technology, but it gets the job done.
I don't think that's true. The paper says that a camera pose estimated by a SLAM system is required. ARKit implements SLAM and can easily provide camera pose for each frame through the ARFrame class. But there are countless other implementations of SLAM, including Android ARCore, Oculus Quest, Roomba, self-driving cars, and a number of GitHub repos (https://github.com/tzutalin/awesome-visual-slam).
I haven’t looked at any metrics but based on using ARKit applications (and various VIO SLAM implementations) it can but it depends highly on the scene, camera motion, and whether there is LIDAR/Stereo Depth.
Companies like Waymo and Cruise use this kind of technology too. Unfortunately there are tons of corner cases of weird things you haven't seen before -- for example, some special vehicles self-occlude and you never get enough coverage to observe them correctly until you're too close. In general, radars and lidars used in _conjunction_ with cameras can handle occluded objects much better.
Also, to measure the performance / evaluate observations generated from this tech, you would want to compare it to a pretty sizable 3D ground truth set which Tesla does not currently have. There are pretty big advantages to starting with a maximal set of sensors even if (eventually) breakthroughs turn them into unnecessary crutches.
I agree that there are pretty big advantages to start with maximal set of sensors, but that is true only when you are a company with the sole goal of trying to make autonomy stack with shit ton of money and a product that you can modify because you own it.
That's not true in the case of Tesla. They started as a EV company that offered autonomy later and since they were selling the product, they had to decide a minimal set of sensors that would, in theory, still work to keep the product cost reasonable.
Tesla does have a sizeable 3D ground truth; they collected tons of data mounting LiDAR on their test vehicles.
> Tesla does have a sizeable 3D ground truth; they collected tons of data mounting LiDAR on their test vehicles.
"Sizable" for evaluating safety has to be big enough to give small confidence intervals on your error -- and with self-driving, you need to cover a robust set of rare scenarios upsampled from general driving as well. I really doubt Tesla has what they need to convince themselves of higher levels of safety.
The use-case of 3D reconstruction vs real-time inference can be very different (but certainly related).
Monocular 3D reconstruction can require many frames at many angles (or possibly just a few frames at sparse angles).
The inference case with a self-driving vehicle may allow for this in some scenes, but certainly not all. Trying to infer the relative motion of say two moving vehicles and getting enough frames for monocular reconstruction may take longer than is required for an emergency break, not to mention the robustness issues with those pose estimations without additional sensors. Solving any one of these issues can be done but I think it’s pretty clear we’re a bit away from solving all of them.
This is the only video I could find, but we were doing monocular reconstruction from a limited number of RGB (not depth) images AND doing voxel segmentation on the processing side. https://www.youtube.com/watch?v=nqy44VSWh3g
Even as far back as 2010 people were doing reasonable monocular reconstruction including software like meshroom etc...the whole of TU Munich also under Matthias Niessner has been doing this for a while.
What's novel here?