Did you consider the fts[0] family of functions for traversal? I use that along with a work queue for filtered entries to get pretty good performance with dedup[1]. For my use case I could avoid any separate stat call altogether, the FTSENT already provided everything I needed.
Those are single threaded, so they would have kneecapped performance pretty badly. 'du' from coreutils uses them, and you can see the drastic speed difference between that and my program in the README.
0 - https://linux.die.net/man/3/fts_read
1 - https://github.com/ttkb-oss/dedup/blob/6a906db5a940df71deb4f...