Nice work. Some times I wonder if there's any way to trade away accuracy for spe...

mos_basik · on July 11, 2024

Something like that exists for btrfs; it's called bdtu. It has the accuracy/time trade-off you're interested in, but the implementation is quite different. It samples random points on the disk and finds out what file path they belong to. The longer it runs the more accurate it gets. The readme is good at explaining why this approach makes sense for btrfs and what its limitations are.

https://github.com/CyberShadow/btdu

renewiltord · on July 11, 2024

Damn, `ext4` is organized differently entirely. You can't get anything useful from:

    sudo debugfs -R "icheck $RANDOM" /dev/nvme1
    sudo debugfs -R "ncheck $res" /dev/nvme1

and recursing. That's a clever technique given btrfs structs.

jszymborski · on July 11, 2024

That's so cool.

oehpr · on July 11, 2024

That is so cool!!! I have always wanted something like this! Arg I wish other filesystems supported a strategy like this!

201984 · on July 11, 2024

Thanks!

What you described is a neat idea, but it's not possible with any degree of accuracy AFAIK. To give you a picture of the problem, calculating the disk usage of a directory requires calling statx(2) on every file in that directory, summing up the reported sizes, and then recursing into every subdirectory and starting over. The problem with doing a partial search is that all the data is at the leaves of the tree, so you'll miss some potentially very large files.

Picture if your program only traversed the first, say, three levels of subdirectories to get a rough estimate. If there was a 1TB file down another level, your program would miss it completely and get a very innaccurate estimate of the disk usage, so it wouldn't be useful at all for finding the biggest culprits. You have the same problem if you decide to stop counting after seeing N files, since file N+1 could be gigantic and you'd never know.

montroser · on July 11, 2024

Yeah, maybe approximation is not really possible. But it still seems like if you could do say, up to 1000 stats per directory per pass, then running totals could be accumulated incrementally and reported along the way.

So after just a second or two, you might be able to know with certainty that a bunch of small directories are small, and then that a handful of others are at least however big has been counted so far. And that could be all you need, or else you could wait longer to see how the bigger directories play out.

geertj · on July 11, 2024

You would still have to getdents() everything but this way you may indeed save on stat() operations, which access information that is stored separately on disk and eliminating these would likely help uncached runs.

You could sample files in a directory or across directories to get an average file size and use the total number of files from getdents to estimate a total size. This does require you to know if a directory entry is a file or directory, which the d_type field gives you depending on the OS, file system and other factors. An average file size could also be obtained from statvfs().

Another trick is based on the fact that the link count of a directory is 2 + the number of subdirectories. Once you have seen the corresponding number of subdirectories, you know that there are no more subdirectories you need to descend into. This could allow you to abort a getdents for a very large directory, using eg the directory size to estimate the total entries.

olddustytrail · on July 11, 2024

> Another trick is based on the fact that the link count of a directory is 2 + the number of subdirectories.

For anyone who doesn't know why this is, it's because when you create a directory it has 2 hard links to it which are

    dirname
    dirname/.

When you add a new subdirectory it adds one more link which is

    dirname/subdir/..

So each subdirectory adds one more to the original 2.

BeeOnRope · on July 11, 2024

This seems difficult since I'm not aware of any way to get approximate file sizes, at least with the usual FS-agnostic system calls: to get any size info you are pretty much calling something in the `stat` family and at that point you have the exact size.

fsckboy · on July 11, 2024

i thought files can be sparse and have holes in the middle where nothing is allocated, so the file size is not what is used to calculate usage, it's the sum of the extents or some such.

BeeOnRope · on July 11, 2024

Yes, files can be sparse but the actual disk usage information is also returned by these stat-family calls, so there is no special cost to handling sparse files.

lenkite · on July 11, 2024

Wish modern filesystems maintained usage per dir as a directory file attribute instead of mandating tools to do this basic job.

nh2 · on July 11, 2024

CephFS does that.

You can use getfattr to ask it for the recursive number of entries or bytes in a given directory.

Querying it is constant time, updates update it with a few seconds delay.

Extremely useful when you have billions of files on spinning disks, where running du/ncdu would take a month just for the stat()s.

hsbauauvhabzb · on July 11, 2024

This is an excellent point and I wholeheartedly agree!

masklinn · on July 11, 2024

Is it? That would require any update to any file to cascade into a bunch of directory updates amplifying the write and for what? Do you “du” in your shell prompt?

Not to mention it would likely be unable to handle the hardlink problem so it would consistently be wrong.

IsTom · on July 11, 2024

> That would require any update to any file to cascade into a bunch of directory updates amplifying the write and for what?

You can be a little lazy about updating parents this and have O(1) update and O(1) amortized read with O(n) worst case (same as now anyway).

aidenn0 · on July 12, 2024

This is probably the right solution, but tou need to rebuild on an unclean unmount if you do it lazily.

hsbauauvhabzb · on July 11, 2024

Disks have improved in I/O and write speed metrics substantially to the point where windows will literally index your file system so you can search faster, and antivirus will scan files in the background before you open them. I don’t think maintaining size state on directories would be all that much of a challenge.

Someone · on July 11, 2024

I expect performance would suffer quite a lot. In a system with high I/O, there would be a lot of contention on updating the size of such directories as /home or /tmp, let alone /.

Also, are you going to update a file’s size for every write (could easily be a thousand times if you’re copying over a 10MB file) or are you going to coalesce updates to file sizes? If the latter, how do you recover after a crash?

Virtual directories such as /dev and /proc would require special-casing.

Mounting and unmounting disks probably would require special-casing.

hsbauauvhabzb · on July 12, 2024

Haven’t many similar issues been solved in journaled file systems and/or things like database transaction logs and indexes? Real-time high precision accuracy is not required, knowing how big a directory is, is a frequent use case of directories. Hell, ‘df’ tracks this at the partition level, including edge cases, as does ‘du’

Someone · on July 12, 2024

As far as I am aware, neither of those cascade sizes up.

Also, doing that in databases isn’t a solved problem. count(*) can be slow in databases. See for example

- PostgreSQL: https://dba.stackexchange.com/questions/314371/count-queries..., https://wiki.postgresql.org/wiki/Count_estimate

- Oracle: https://forums.oracle.com/ords/apexds/post/select-count-very...

(Both databases use MVCC (https://en.wikipedia.org/wiki/Multiversion_concurrency_contr...) to ensure that concurrent queries all see the database in a consistent state. That makes it necessary to visit each row and check their time stamp when counting rows)

aidenn0 · on July 11, 2024

I have a "du" command currently running that has been running for ~50 hours. I'd much rather have it update a half-dozen directory entries on each write.