Some of your repliers are beating around the bush a little, I'm going to come ri...

cousin_it · on March 15, 2009

It can't be fixed without making them no longer great source control systems.

Why? Is there some deep architectural reason why Git can't perform like Perforce on large binary files? Something so deep it cannot ever be fixed? I've read through this whole thread and see no such reason yet, only hints that it exists.

jerf · on March 15, 2009

Tradeoffs. silentbicycle's explanation is pretty good, but I want to call out the fact that you simply can not have an optimal source control system and an optimal binary blob management system. The two share a lot of similarities and there's a core that you could probably extract to use to build both, but when you're talking optimal systems, there are forces that are in conflict.

The problem is that if you aren't hip-deep in both systems, you often can't see the tradeoffs, or if someone explains them to you, you might say "But just do this and this and this and you're done!" Hopefully, you've had some experience of someone coming up to you and saying that about some system you've written, perhaps your boss, so you know how it just doesn't work that way, because it's never that easy. If you haven't had this experience, you probably won't understand this point until you have.

There are always tradeoffs.

Lately at my work, I've run into a series of issues as I get closer to optimal in some parts of the product I'm responsible for where I have to make a decision that will either please one third of my customer base, or two thirds of my customer base. Neither are wrong, doing both isn't feasible, and the losing customers call in and wonder why they can't have it their way, and there isn't an answer I can give that satisfies them... but nevertheless, I have to chose. (I don't want to give specific examples, but broad examples would include "case sensitivity" (either way you lose in some cases) or whether or not you give a particularly moderately important unavoidable error message; half your customers are annoyed it shows up and the other half would call in to complain that it doesn't.) You can't have it all.

silentbicycle · on March 15, 2009

To my understanding: When git looks for changes it scans all the tracked files (checking timestamps before hashing) and hashing those that look changed. Commits generate patches for all changed files, then generate hashes for each file, then hash the set of files in each directory for a hash of the directory, repeated until the state of the entire repository is collected into one hash. This is normally pretty fast, and has a lot of advantages as a way to represent state changes in the project, but it also means that if the project has several huge binary files sitting about (or thousands of large binaries, etc.), it will have to hash them as well. This requires a full pass through the file any time that they look like they might have changed, new files are added, etc. (Mercurial works very similarly, though the internal data structures are different.) Running sha1 on a 342M file just took about 9 seconds on my computer; this goes up linearly with file size.

Git deals with the state of the tree as a whole, while Perforce, Subversion, and some others work at a file-by-file level, so they only need to scan the huge files when adding them, doing comparisons for updates, etc. (Updating or scanning for changes on perforce or subversion does scan the whole tree, though, which can be very slow.)

You can make git ignore the binary files via .gitignore, of course, and then they won't slow source tracking down anymore. You need to use something else to keep them in sync, though. (Rysnc works well for me. Unison is supposed to be good for this, as well.) You can still track the file metadata, such as a checksum and a path to fetch it from automatically, in a text file in git. It won't be able to do merges on the binaries, but how often do you merge binaries?

mst · on March 15, 2009

Well, rsync has a mode where you say "look, if the file size and timestamp are the same, please just assume it hasn't changed - I'm happy with this and am willing to accept that if that isn't sufficient any resulting problems are mine".

I wonder if having some way to tell git to do the same for files with a particular extension/in a particular directory would get us a decent amount of the way there?

silentbicycle · on March 15, 2009

While I haven't checked the source, I'm pretty sure by default git doesn't re-hash any files unless the timestamps have changed. (I know Mercurial doesn't.)

silentbicycle · on March 15, 2009

The scaling situation for the top post would involve large binaries (builds or generated data) being added on a regular basis.

(oops, missed the edit window)

silentbicycle · on March 15, 2009

Git IS a key/value store database. (A filesystem is a kind of database, too, of course.) There's a good summary of the internal data structures here -- http://www.kernel.org/pub/software/scm/git/docs/user-manual....

As I've said elsewhere in this thread, tracking metadata (path and sha1 hash) for large binary files in git and otherwise ignoring them via .gitignore works quite well. I'm pretty sure the "right tool" is either rysnc or something similar.

twopoint718 · on March 15, 2009

You know, that got me thinking. ZFS will do versioning and, as a filesystem, you'd be keeping your binary data in it anyway. In ZFS, this versioning is implemented as a tree of data blocks, only those blocks that change between versions would be "new". If a block is unchanged, ZFS can exploit shared structure to avoid needless copying.

silentbicycle · on March 15, 2009

Right. You can do a lot of things (version control and encryption come to mind) at the filesystem level. A filesystem is a specialized kind of database, anyway, and databases are surprisingly versatile.

If memory serves, you can automatically mount daily snapshots of FreeBSD's standard filesystem. (I'm using OpenBSD, which is slightly different.)