Some of your repliers are beating around the bush a little, I'm going to come right out and say it: Source control systems are for controlling source code. They are built from top to bottom around the idea that they are storing text files. They are build around the idea that a text-based patch is a meaningful thing to use. They are build around the idea that there is a reasonable merge algorithm to use to merge two people's changes.
They can be used on things that physically resemble source code but aren't, like text-based document formats (raw HTML, for instance), but you probably won't need their full power and, by design, they lack features that would be useful in that case. For your convenience they are capable of storing small quantities of binary data since most projects have a little here and there. But in both cases, you're a bit out-of-spec.
When you try to stuff tons of binary data into them, they break. They do not just break at the practical level, in the sense that operations take a long time but maybe someday somebody could fix them with enough performance work. They break at the conceptual level. Their whole worldview is no longer valid. The invariants they are built on are gone. It's not just a little problem, it's fundamental, and here's the really important thing: It can't be fixed without making them no longer great source control systems.
I use git on a fairly big repository and the scanning is currently at the "annoying" level for me, but the scanning is there for good reasons, reasons related to its use as a source control system. On my small personal projects it definitely helps me a lot.
SVN is the wrong tool for the job. I don't know what the right tool is. It may not even exist. But SVN is still the wrong tool. (If nothing else you could hack together one of those key/value stores that have been on HN lately and cobble together something with the resulting hash values.)
And, going back to the original link, criticizing git for not working with a repository with large numbers of binary files is not a very interesting critique. If Perforce does work under those circumstances, I would conclude they've almost certainly had to make tradeoffs that make it a less powerful source control system. Based on what I've heard from Perforce users and critics, that is an accurate conclusion. But I have no direct experience myself.
It can't be fixed without making them no longer great source control systems.
Why? Is there some deep architectural reason why Git can't perform like Perforce on large binary files? Something so deep it cannot ever be fixed? I've read through this whole thread and see no such reason yet, only hints that it exists.
Tradeoffs. silentbicycle's explanation is pretty good, but I want to call out the fact that you simply can not have an optimal source control system and an optimal binary blob management system. The two share a lot of similarities and there's a core that you could probably extract to use to build both, but when you're talking optimal systems, there are forces that are in conflict.
The problem is that if you aren't hip-deep in both systems, you often can't see the tradeoffs, or if someone explains them to you, you might say "But just do this and this and this and you're done!" Hopefully, you've had some experience of someone coming up to you and saying that about some system you've written, perhaps your boss, so you know how it just doesn't work that way, because it's never that easy. If you haven't had this experience, you probably won't understand this point until you have.
There are always tradeoffs.
Lately at my work, I've run into a series of issues as I get closer to optimal in some parts of the product I'm responsible for where I have to make a decision that will either please one third of my customer base, or two thirds of my customer base. Neither are wrong, doing both isn't feasible, and the losing customers call in and wonder why they can't have it their way, and there isn't an answer I can give that satisfies them... but nevertheless, I have to chose. (I don't want to give specific examples, but broad examples would include "case sensitivity" (either way you lose in some cases) or whether or not you give a particularly moderately important unavoidable error message; half your customers are annoyed it shows up and the other half would call in to complain that it doesn't.) You can't have it all.
To my understanding: When git looks for changes it scans all the tracked files (checking timestamps before hashing) and hashing those that look changed. Commits generate patches for all changed files, then generate hashes for each file, then hash the set of files in each directory for a hash of the directory, repeated until the state of the entire repository is collected into one hash. This is normally pretty fast, and has a lot of advantages as a way to represent state changes in the project, but it also means that if the project has several huge binary files sitting about (or thousands of large binaries, etc.), it will have to hash them as well. This requires a full pass through the file any time that they look like they might have changed, new files are added, etc. (Mercurial works very similarly, though the internal data structures are different.) Running sha1 on a 342M file just took about 9 seconds on my computer; this goes up linearly with file size.
Git deals with the state of the tree as a whole, while Perforce, Subversion, and some others work at a file-by-file level, so they only need to scan the huge files when adding them, doing comparisons for updates, etc. (Updating or scanning for changes on perforce or subversion does scan the whole tree, though, which can be very slow.)
You can make git ignore the binary files via .gitignore, of course, and then they won't slow source tracking down anymore. You need to use something else to keep them in sync, though. (Rysnc works well for me. Unison is supposed to be good for this, as well.) You can still track the file metadata, such as a checksum and a path to fetch it from automatically, in a text file in git. It won't be able to do merges on the binaries, but how often do you merge binaries?
Well, rsync has a mode where you say "look, if the file size and timestamp are the same, please just assume it hasn't changed - I'm happy with this and am willing to accept that if that isn't sufficient any resulting problems are mine".
I wonder if having some way to tell git to do the same for files with a particular extension/in a particular directory would get us a decent amount of the way there?
While I haven't checked the source, I'm pretty sure by default git doesn't re-hash any files unless the timestamps have changed. (I know Mercurial doesn't.)
As I've said elsewhere in this thread, tracking metadata (path and sha1 hash) for large binary files in git and otherwise ignoring them via .gitignore works quite well. I'm pretty sure the "right tool" is either rysnc or something similar.
You know, that got me thinking. ZFS will do versioning and, as a filesystem, you'd be keeping your binary data in it anyway. In ZFS, this versioning is implemented as a tree of data blocks, only those blocks that change between versions would be "new". If a block is unchanged, ZFS can exploit shared structure to avoid needless copying.
Right. You can do a lot of things (version control and encryption come to mind) at the filesystem level. A filesystem is a specialized kind of database, anyway, and databases are surprisingly versatile.
If memory serves, you can automatically mount daily snapshots of FreeBSD's standard filesystem. (I'm using OpenBSD, which is slightly different.)
They can be used on things that physically resemble source code but aren't, like text-based document formats (raw HTML, for instance), but you probably won't need their full power and, by design, they lack features that would be useful in that case. For your convenience they are capable of storing small quantities of binary data since most projects have a little here and there. But in both cases, you're a bit out-of-spec.
When you try to stuff tons of binary data into them, they break. They do not just break at the practical level, in the sense that operations take a long time but maybe someday somebody could fix them with enough performance work. They break at the conceptual level. Their whole worldview is no longer valid. The invariants they are built on are gone. It's not just a little problem, it's fundamental, and here's the really important thing: It can't be fixed without making them no longer great source control systems.
I use git on a fairly big repository and the scanning is currently at the "annoying" level for me, but the scanning is there for good reasons, reasons related to its use as a source control system. On my small personal projects it definitely helps me a lot.
SVN is the wrong tool for the job. I don't know what the right tool is. It may not even exist. But SVN is still the wrong tool. (If nothing else you could hack together one of those key/value stores that have been on HN lately and cobble together something with the resulting hash values.)
And, going back to the original link, criticizing git for not working with a repository with large numbers of binary files is not a very interesting critique. If Perforce does work under those circumstances, I would conclude they've almost certainly had to make tradeoffs that make it a less powerful source control system. Based on what I've heard from Perforce users and critics, that is an accurate conclusion. But I have no direct experience myself.