Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

With git-annex the size of the repo isn't as important, it's more about how many files you have stored in it.

I find git-annex to become a bit unwieldy at around 20k to 30k files, at least on modest hardware like a Raspberry Pi or a core i3.

(This hasn't been a problem for my use case, I've just split things up into a couple of annex repos)



> I find git-annex to become a bit unwieldy at around 20k to 30k files

Oh well there goes my plan to use it for the 500 million files we have in cloud storage!


I only recently started using it, but I think most of the limitation on metadata is from git itself. (Remember all the metadata is in git; the data is in the "annex".)

This person said they put over a million files in a single git repo, and pushed it to Github, and that was in 2015.

https://www.monperrus.net/martin/one-million-files-on-git-an...

I'm using git annex for a repository with 100K+ files and it seems totally fine.

If you're running on a Raspberry Pi, YMMV, but IME Raspberry Pi's are extremely slow at tasks like compiling CPython, so it wouldn't surprise me if they're also slow at running git.

I remember measuring and a Rasperry Pi with 5x lower clock rate than an Intel CPU (700 Mhhz vs. 3.5 Ghz) was more like fifty times slower, not 5 times slower.

---

That said, 500 million is probably too many for one repo. But I would guess not all files meed a globally consistent version, so you can have multiple repos.

Also df --inodes on my 4T and 8T drives shows 240 million, so you would most likely have to format a single drive in a special way. But it's not out of the ballpark. I think the sync algorithms would probably get slow at that number.

It's definitely not a cloud storage replacement now, but I guess my goal is to avoid cloud storage :) Although git annex is complementary to the cloud and has S3 and Glacier back ends, among many others.


Thanks for the info! I thought I was joking about the possibility of using git-annex in our case, but you've made me realize that it's not out of the realm of possibility.

We could certainly shard our usage e.g. by customer - they're enterprise customers so there aren't that many of them. We wouldn't be putting the files themselves into git anyway - using a cloud storage backend would be fine.

We currently export directory listings to BigQuery to allow us to analyze usage and generate lists of items to delete. We used to use bucket versioning but found that made it harder to manage - we now manage versioning ourselves. git-annex could potentially help manage the versioning, at least, and could also provide an easier way to browse and do simple queries on the file listings.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: