Curious how this deals work moving training data around... If you're dataset is a few gbs moving it around is a good bit of overhead, and a decent chunk of local disk space for the host system. Probably not bad of there's a consistent task, but seems like a big problem if the tasks change often.
/* hypothesizing */
If you're using it for NLP, your dataset (token ids) typically weighs much less than intermediate tensors. So, i see two scenarios here:
(1) distribute data chunks as you train using more conventional bittorrent systems (e.g. https://academictorrents.com but internal)
(2) since you most likely use raw unlabeled data (e.g. just text), peers can crawl it straight from the web
Yeah, it's probably less of a concern for text tasks, where the data per example is relatively light (though there is a whole internet worth of text data...)
I mostly work with audio, where individual examples are ~2MB, so the dataset sizes get very heavy quickly.