I haven't read this paper in detail. At a glance, it seems interesting, but not necessarily a huge breakthrough compared to SOTA in monocular 3D reconstruction (which is not a criticism; most papers are incremental and that's fine). That said, what I find with a lot of work on neural techniques applied to 3D reconstruction/VSLAM (a pretty active area in the past few years) is that they're often not significantly more accurate than prior, non-ML techniques, and in the beginnings, often less accurate. On the other hand, they're often more robust. Classical geometry- and optimization- based approaches are fairly brittle, and tend to fail catastrophically or converge to totally implausible solutions in tricky conditions. Learning-based approaches tend to be better at degrading gracefully.