Nice work. Some times I wonder if there's any way to trade away accuracy for speed? Like, often I don't care _exactly_ how many bytes is the biggest user of space, but I just want to see some orders of magnitude.
Maybe there could be an iterative breadth-first approach, where first you quickly identify and discard the small unimportant items, passing over anything that can't be counted quickly. Then with what's left you identify the smallest of those and discard, and then with what's left the smallest of those, and repeat and repeat. Each pass through, you get a higher resolution picture of which directories and files are using the most space, and you just wait until you have the level of detail you need, but you get to see the tally as it happens across the board. Does this exist?
Something like that exists for btrfs; it's called bdtu. It has the accuracy/time trade-off you're interested in, but the implementation is quite different. It samples random points on the disk and finds out what file path they belong to. The longer it runs the more accurate it gets. The readme is good at explaining why this approach makes sense for btrfs and what its limitations are.
What you described is a neat idea, but it's not possible with any degree of accuracy AFAIK. To give you a picture of the problem, calculating the disk usage of a directory requires calling statx(2) on every file in that directory, summing up the reported sizes, and then recursing into every subdirectory and starting over. The problem with doing a partial search is that all the data is at the leaves of the tree, so you'll miss some potentially very large files.
Picture if your program only traversed the first, say, three levels of subdirectories to get a rough estimate. If there was a 1TB file down another level, your program would miss it completely and get a very innaccurate estimate of the disk usage, so it wouldn't be useful at all for finding the biggest culprits. You have the same problem if you decide to stop counting after seeing N files, since file N+1 could be gigantic and you'd never know.
Yeah, maybe approximation is not really possible. But it still seems like if you could do say, up to 1000 stats per directory per pass, then running totals could be accumulated incrementally and reported along the way.
So after just a second or two, you might be able to know with certainty that a bunch of small directories are small, and then that a handful of others are at least however big has been counted so far. And that could be all you need, or else you could wait longer to see how the bigger directories play out.
You would still have to getdents() everything but this way you may indeed save on stat() operations, which access information that is stored separately on disk and eliminating these would likely help uncached runs.
You could sample files in a directory or across directories to get an average file size and use the total number of files from getdents to estimate a total size. This does require you to know if a directory entry is a file or directory, which the d_type field gives you depending on the OS, file system and other factors. An average file size could also be obtained from statvfs().
Another trick is based on the fact that the link count of a directory is 2 + the number of subdirectories. Once you have seen the corresponding number of subdirectories, you know that there are no more subdirectories you need to descend into. This could allow you to abort a getdents for a very large directory, using eg the directory size to estimate the total entries.
This seems difficult since I'm not aware of any way to get approximate file sizes, at least with the usual FS-agnostic system calls: to get any size info you are pretty much calling something in the `stat` family and at that point you have the exact size.
i thought files can be sparse and have holes in the middle where nothing is allocated, so the file size is not what is used to calculate usage, it's the sum of the extents or some such.
Yes, files can be sparse but the actual disk usage information is also returned by these stat-family calls, so there is no special cost to handling sparse files.
Is it? That would require any update to any file to cascade into a bunch of directory updates amplifying the write and for what? Do you “du” in your shell prompt?
Not to mention it would likely be unable to handle the hardlink problem so it would consistently be wrong.
Disks have improved in I/O and write speed metrics substantially to the point where windows will literally index your file system so you can search faster, and antivirus will scan files in the background before you open them. I don’t think maintaining size state on directories would be all that much of a challenge.
I expect performance would suffer quite a lot. In a system with high I/O, there would be a lot of contention on updating the size of such directories as /home or /tmp, let alone /.
Also, are you going to update a file’s size for every write (could easily be a thousand times if you’re copying over a 10MB file) or are you going to coalesce updates to file sizes? If the latter, how do you recover after a crash?
Virtual directories such as /dev and /proc would require special-casing.
Mounting and unmounting disks probably would require special-casing.
Haven’t many similar issues been solved in journaled file systems and/or things like database transaction logs and indexes? Real-time high precision accuracy is not required, knowing how big a directory is, is a frequent use case of directories. Hell, ‘df’ tracks this at the partition level, including edge cases, as does ‘du’
(Both databases use MVCC (https://en.wikipedia.org/wiki/Multiversion_concurrency_contr...) to ensure that concurrent queries all see the database in a consistent state. That makes it necessary to visit each row and check their time stamp when counting rows)
I have a "du" command currently running that has been running for ~50 hours. I'd much rather have it update a half-dozen directory entries on each write.
Maybe there could be an iterative breadth-first approach, where first you quickly identify and discard the small unimportant items, passing over anything that can't be counted quickly. Then with what's left you identify the smallest of those and discard, and then with what's left the smallest of those, and repeat and repeat. Each pass through, you get a higher resolution picture of which directories and files are using the most space, and you just wait until you have the level of detail you need, but you get to see the tally as it happens across the board. Does this exist?