Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Real-Time Coherent 3D Reconstruction from Monocular Video (zju3dv.github.io)
178 points by samber on March 14, 2022 | hide | past | favorite | 42 comments


Honestly this doesn't look any better than what we were doing back in 2016-2017. I'm not sure what's novel here.

This is the only video I could find, but we were doing monocular reconstruction from a limited number of RGB (not depth) images AND doing voxel segmentation on the processing side. https://www.youtube.com/watch?v=nqy44VSWh3g

Even as far back as 2010 people were doing reasonable monocular reconstruction including software like meshroom etc...the whole of TU Munich also under Matthias Niessner has been doing this for a while.

What's novel here?


Their research doesn't just integrate depth maps into a TSDF - it uses NN's to incorporate surface priors.

I don't recall you having similar real-time meshing functionality in 2016-2017, Andrew. Can you show what you had?

As far as I'm aware, Abound was the first to demo real-time monocular mobile meshing: on Android in early 2017 (e.g. https://www.youtube.com/watch?v=K9CpT-sy7HE), and iOS in early 2018 (e.g. https://twitter.com/nobbis/status/972298968574013440).


This is a paper about a new way of storing/merging 3D data.

The actual 3D reconstruction is so-so, I agree. And they kinda cheat by using ARKit (which uses LIDAR internally) to get good camera poses even if there is little texture.

So the novel part here is that they can immediately merge all the images into a coherent representation of the 3D space, as opposed to first doing bundle adjustment, then doing pairwise depth matching, then doing streak-based depth matching, and then merging the resulting point clouds.

Also, they can use learned 3D shape priors to improve their results. Basically that means "if there is no visible gap, assume the surface is flat". But AFAIK, that's not new.

EDIT: My main criticism of this paper after looking at the source code a bit would be that due to the TSDF, which is like a 3D voxel grid, they need insane amounts of GPU memory, or else the scenes either need to be very small or low resolution. That is most likely also the reason why they reconstruction looks so cartoon-like and is smooth on all corners: They lack the memory to store more high-frequency details.

EDIT2: Mainly, it looks like they managed to reduce the GPU memory consumption of Atlas [1] which is why they can reconstruct larger areas and/or higher resolution. But it's still far less detail than Colmap [2].

[1] https://github.com/magicleap/Atlas

[2] https://colmap.github.io/


ARKit doesn’t use LiDAR for camera pose tracking.

TSDF memory isn’t an issue since Niessner et al. (2013).


> ARKit doesn’t use LiDAR for camera pose tracking.

It doesn't by default, for power reasons, but it will in a pinch.


Interesting - I’ve never observed it kicking in.

Would love to know in which circumstances it’s used. I assume you work for Apple to know this so understand if you can’t share more.


> TSDF memory isn’t an issue since Niessner et al. (2013).

I would strongly disagree. This paper uses TSDF and runs into memory issues. And ATLAS is using TSDF and running into memory issues. So for practical applications, TSDF is still too memory hungry.


Correlation != causation.

Try out our app, Metascan, to see an example of using TSDF with a multi-resolution GPU hashtable that only stores voxel data near surfaces. Or just skim the original voxel hashing paper from 2013 to understand the technique.

Storing voxel data in an array is a lot simpler. So if it’s not the focus of the research, then why would academics engineer something more complex?


Looks like a much better response to white walls/texture less surfaces.


I haven't read this paper in detail. At a glance, it seems interesting, but not necessarily a huge breakthrough compared to SOTA in monocular 3D reconstruction (which is not a criticism; most papers are incremental and that's fine). That said, what I find with a lot of work on neural techniques applied to 3D reconstruction/VSLAM (a pretty active area in the past few years) is that they're often not significantly more accurate than prior, non-ML techniques, and in the beginnings, often less accurate. On the other hand, they're often more robust. Classical geometry- and optimization- based approaches are fairly brittle, and tend to fail catastrophically or converge to totally implausible solutions in tricky conditions. Learning-based approaches tend to be better at degrading gracefully.


Its not a fair fight but your example is much much worse. The mesh reconstruction of the chairs are not suitable for occlusion or physics in your example.


Says it’s real-time


I always found ORB-SLAM2 pretty impressive, which can map 3D neighborhoods in realtime while you drive around in a car:

https://www.youtube.com/watch?v=ufvPS5wJAx0

https://www.youtube.com/watch?v=3BrXWH6zRHg


ORB-SLAM2 is SLAM only (similar to ARKit w/o VIO) so only reconstructs feature points - no surfaces. This research is about surface reconstruction.



That requires depth input


Good point, I don't recall offhand the paper that was the mono-RT one.

At a minimum though 6D.ai and a few others had companies that were selling this as a service at least as far back as 2017.


6D.ai wasn’t released until 2018. You may be thinking of Abound, which you wanted to license in March 2017. I don’t know of anyone else that had real-time meshing in 2017.


Fast enough to be used for mobile robots?


That's impressive, in that it handles white walls well. Most of the algorithms for that sort of thing have trouble registering big uniform surfaces. That's why many tracking algorithms want objects with lots of texture detail. Some add texture by projecting a noise pattern on the surface being tracked. That's sometimes called "unstructured light", the opposite being projecting a structured pattern such as a grid. The first generation Kinect did that.

Of course, this network may do well on this because it's trained on indoor scenes with walls of uniform height and width. Uniform implies flat is likely to emerge as an assumption. You get to see the wall/floor joints and the wall/wall joints, so you have some references.

Remember the Tesla that hit the big white semitrailer because the algorithm couldn't measure depth to a uniform surface? This is a hard problem in unstructured situations.


It's not exactly the same but Neural Radiance Fields are getting more impressive:

First one was this but it was slow: https://www.matthewtancik.com/nerf

Then it's got faster: https://www.youtube.com/watch?v=fvXOjV7EHbk

Lot of interesting papers:

https://github.com/yenchenlin/awesome-NeRF


I am surprised that the team didn't choose to add the "Fusion" append at the end.

This seems to fit into the genealogy of KinectFusion, ElasticFusion, BundleFusion, etc.

https://www.microsoft.com/en-us/research/wp-content/uploads/... https://www.imperial.ac.uk/dyson-robotics-lab/downloads/elas... https://graphics.stanford.edu/projects/bundlefusion/

Very impressive work. I have not seen any use cases for online 3D reconstruction unfortunately. 6D.ai made terrific progress in this tech but also could not find great use cases for online reconstruction and ended up having to sell to Niantic.

Seems like what people want, if they want 3D reconstruction, is extremely high-fidelity scans (a la Matterport) and are willing to wait for the model. Unfortunately TSDF approach create a "slimey" end look which isn't usually what people are after if they want an accurate 3D reconstruction.

It SEEMS like online 3D reconstruction would be helpful, but I have yet to see a use case for "online"...


Use case: Mobile robotics, lidar replacement in self-driving vehicles


I'm very curious to see how well this would work for online terrain reconstruction. I've got a drone with a pretty powerful onboard computer and it's always nice to be able to solve and tune problems with software instead of additional (e.g. LIDAR) hardware.


Cave exploration. Put the helmet on, never get lost.

Also youtube videos and such with more complex animated characters jumping around and things.

If they could convert that to a floorplan diagram it would be wonderful for UI design, lots of use cases for a map.


Does anyone know what the state of the art is for doing this type of reconstruction as a streaming input to detection and recognition algorithms? For instance, this could be used for object detection and identification on a recycling conveyor line.


I don't believe that either does reconstruction... but for the recycling application, there are a handful of companies tackling this problem -- e.g. Everest Labs & Amp Robotics.


The big name in this is Bulk Handling Systems.[1] They build and sell large complete recycling plants. They do about 90% of the sorting with a series of mechanical processes. The robots do only the "quality control" step, pulling out items that got past the choppers, shakers, screens, air blasts, and magnets. That gets the outputs up to 99%+ of the desired product, so it can go into recycling processes that take in cardboard, plastics sorted by type, etc. It makes the robotics cost effective.

San Francisco's recycling operation, at Pier 95, uses a system built by them.

It's not a glamorous technology, but it gets the job done.

[1] https://www.max-ai.com/video-max-ai-autonomous-qc/


Looked cool, then I read that there is some Apple ARkit magic blackbox in the middle of it all.


I don't think that's true. The paper says that a camera pose estimated by a SLAM system is required. ARKit implements SLAM and can easily provide camera pose for each frame through the ARFrame class. But there are countless other implementations of SLAM, including Android ARCore, Oculus Quest, Roomba, self-driving cars, and a number of GitHub repos (https://github.com/tzutalin/awesome-visual-slam).


Yeah, I also consider that odd to use LIDAR-based poses and then call it "monocular".


Agreed, I’m interested to see how sensitive it is to camera pose error, and especially long-term drift.


Shame there's no Android or iPhone app available


can ARkit return accurate camera position?


I haven’t looked at any metrics but based on using ARKit applications (and various VIO SLAM implementations) it can but it depends highly on the scene, camera motion, and whether there is LIDAR/Stereo Depth.


So much for the folks who think Tesla is on a fool’s errand when they’re using cameras instead of LIDAR.


Companies like Waymo and Cruise use this kind of technology too. Unfortunately there are tons of corner cases of weird things you haven't seen before -- for example, some special vehicles self-occlude and you never get enough coverage to observe them correctly until you're too close. In general, radars and lidars used in _conjunction_ with cameras can handle occluded objects much better.

Also, to measure the performance / evaluate observations generated from this tech, you would want to compare it to a pretty sizable 3D ground truth set which Tesla does not currently have. There are pretty big advantages to starting with a maximal set of sensors even if (eventually) breakthroughs turn them into unnecessary crutches.


I agree that there are pretty big advantages to start with maximal set of sensors, but that is true only when you are a company with the sole goal of trying to make autonomy stack with shit ton of money and a product that you can modify because you own it.

That's not true in the case of Tesla. They started as a EV company that offered autonomy later and since they were selling the product, they had to decide a minimal set of sensors that would, in theory, still work to keep the product cost reasonable.

Tesla does have a sizeable 3D ground truth; they collected tons of data mounting LiDAR on their test vehicles.


> Tesla does have a sizeable 3D ground truth; they collected tons of data mounting LiDAR on their test vehicles.

"Sizable" for evaluating safety has to be big enough to give small confidence intervals on your error -- and with self-driving, you need to cover a robust set of rare scenarios upsampled from general driving as well. I really doubt Tesla has what they need to convince themselves of higher levels of safety.


That was very insightful. Do you work in that space? It is comments like yours that make HN a special place.


The use-case of 3D reconstruction vs real-time inference can be very different (but certainly related).

Monocular 3D reconstruction can require many frames at many angles (or possibly just a few frames at sparse angles).

The inference case with a self-driving vehicle may allow for this in some scenes, but certainly not all. Trying to infer the relative motion of say two moving vehicles and getting enough frames for monocular reconstruction may take longer than is required for an emergency break, not to mention the robustness issues with those pose estimations without additional sensors. Solving any one of these issues can be done but I think it’s pretty clear we’re a bit away from solving all of them.


The failure mode (for example: decapitation; https://www.latimes.com/business/la-fi-tesla-florida-acciden...) is pretty significant when used in a Tesla. Less so in this tech demo.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: