More

bblcla · on Aug 23, 2024

Yeah, I agree! We looked into integrating Moonglow with academic clusters, because many of my ML PhD friends complained about using them. We unfortunately haven't found a good generalized solution, so I think VSCode's remote SSH + manual server management is probably the best option for now.

bblcla · on Aug 23, 2024

We make sure the remote containers have CUDA/Pytorch/Numpy/Matplotlib set up if you're using a GPU-based machine. It's actually far easier for me to run ML-based code through Moonglow now than on my Macbook - it's really nice to start with a clean environment every time, instead of having to deal with dependency hell.

We don't yet transfer the python environment on the self-serve options, though for customers on AWS we'll help them create and maintain images with the packages they need.

I do have some ideas for making it easy to transfer environments over - it would probably involve letting people specify a requirements.txt and some apt dependencies and then automatically creating/deploying containers around that. Your idea of actually just detecting what's installed locally is pretty neat too, though.

bblcla · on Aug 23, 2024

I'm not super familiar with dbx (though its docs at https://docs.databricks.com/en/archive/dev-tools/dbx/dbx.htm... suggest it's deprecated).

However, looking at its replacement here (https://docs.databricks.com/en/dev-tools/bundles/index.html) - I think we're trying to solve the same problems at different levels. My guess is Databricks is the right solution for big teams that need well-defined staging/prod/dev environment. We're targeting smaller teams that might be doing more of their own devops or are still at the 'using a bash script to run notebooks remotely' stage.

whinvik · on Aug 24, 2024

Thanks. I did not know dbx was deprecated and replaced.

Wouldn't targeting smaller teams lead to a lot of pricing pressure? Or do you think there's enough volume to justify that?

bblcla · on Aug 23, 2024

A lot of the people we've talked to who get the most value out of remote compute are doing really intensive stuff - they need server-level resources far beyond what you can find on a consumer laptop!

Hopefully someday you'll have 8 H100s on your Macbook, but I think we're still a long way away from that.

bblcla · on Aug 23, 2024

Thanks!

The big difference is that Google Colab runs in your web browser, whereas Moonglow lets you connect to compute in the VSCode/Cursor notebook interface. We've found a lot of people really like the code-completion in VSCode/Cursor and want to be able to access it while writing notebook code.

Colab only lets you connect to compute provided by Google. For instance, even Colab Pro doesn't offer H100s, whereas you can get that pretty easily on Runpod.

williamstein · on Aug 23, 2024

> Colab only lets you connect to compute provided by Google.

That is no longer true - you can use remote kernels on your own compute via colab: https://research.google.com/colaboratory/local-runtimes.html

There is also the same feature in CoCalc, including using the official colab Docker image: https://doc.cocalc.com/compute_server.html#onprem

Cocalc also supports 1 click use of vscode.

(The above might not work with runpod, since their execution environment is locked down. However it works with other clouds like Lambda, Hyperstack, etc.)

bblcla · on Aug 23, 2024

Ah, yeah, I misspoke, sorry. I was aware of that feature, but everyone I've talked to said it's so annoying to use they basically never use it, so I didn't think it was worth mentioning.

The big reason it's annoying is because (I believe) Colab still only lets you connect to runtimes running on your computer - which is why at the end at the end of that article they suggest using SSH port forwarding if you want to connect to a remote cluster. I know at least one company has written a hacky wrapper that researchers can use to connect to their own cluster through Colab, but it's not ideal.

I think Moonglow's target audience is slightly different than Colab's though because of the tight VSCode/Cursor integration - many people we've talked to said they really value the code-complete, which you can't get in any web frontend!

dovholuknf · on Aug 23, 2024

Interesting idea. I'm not very well-versed in training models or LLMs or even Jupyter Notebooks, but the comment about port forwarding SSH caught my eye since I work on a free, open source zero-trust overlay network (OpenZiti). I tried to find some information about moonglow under the hood / how it worked but didn't succeed.

If you're interested, you might find embedding OpenZiti into Moonglow a pretty compelling alternative to port forwarding and it might open even crazier ideas once your connectivitiy is embedded into the app. You can find the cheapest compute for people and just connect them to that cheapest compute using your extension... Might be interesting? Anyway, I'd be happy to discuss some time if that sounds neat... Until then, good luck with your launch!

bblcla · on Aug 23, 2024

Cool! We actually don't do port forwarding over SSH, we do it over an ngrok-like solution that we forked/modified. I looked at a few options while we were designing this, including Tailscale and ngrok, but none of them exactly suited our needs, and the pricing would have been prohibitive for something that's a pretty core part of our product.

OpenZiti looks really cool though - I'll take a look!

williamstein · on Aug 23, 2024

Is it possible to use OpenZiti with Runpod? Their execution environment is very locked down, which might make ssh the only option.

qrkourier · on Aug 23, 2024

At a glance, the RunPod's serverless and pod options would probably work well with OpenZiti. I didn't explore their vLLM option.

Using OpenZiti w/ Serverless probably means integrating an OpenZiti SDK with your serverless application. That way, it'll connect to the OpenZiti network every time it spawns.

The SDK option works anywhere you can deploy your application because it doesn't need any sidecar, agent, proxy, etc, so it's definitely the most flexible and I can give you some examples if you mention the language or framework you're using.

The pod option says "container based" so it'll take some investigation to find out if an OpenZiti sidecar or other tunneling proxy is an option. Would you be looking to publish something running in RunPod (the server is in RunPod), or access something elsewhere from a RunPod pod (the client is in RunPod), or both?

dovholuknf · on Aug 23, 2024

I poked at it a bit but there was no free trial period. I know a bunch of people are using OpenZiti and zrok for Jupyter notebooks in general... Here's a blog I saw not long back that might help but I wasn't able to prove/test/try it... (sorry)

https://www.pogs.cafe/software/tunneling-sagemaker-kaggle

dovholuknf · on Aug 23, 2024

I don't actually know. I'll go poke with Runpod for a few and see :)

elashri · on Aug 23, 2024

> I think Moonglow's target audience is slightly different than Colab's though because of the tight VSCode/Cursor integration - many people we've talked to said they really value the code-complete, which you can't get in any web frontend!

At the risk of repeating the famous Dropbox comment

I like the idea and that the ease of usage is your selling point. But I don't know if that is actually a reasonable reason. People who are entrenched that much in VSCode ecosystem wouldn't find it a problem to deploy dockerized Nvidia GPU container and connect to their own compute instance via remote/tunnel plugins on VSCode which one can argue does make more sense.

Congratulations on the launch and good luck with the product.

tmychow · on Aug 23, 2024

Thanks! I think the "deploy and connect" workflow is itself not super painful, but even if you're invested in VSCode, doing that again and again every day is pretty annoying (and it certainly was for me when I used to do ML), so hopefully the ease of use is valuable for people.

bblcla · on Aug 23, 2024

Thanks!

One thing I've found while working in the ML space is that it seems like ML researchers have to deal with a lot of systems cruft. I think that in the limit, ML researchers basically only care having about a few things set up well:

- secrets and environment management

- making sure their dependencies are installed

- efficient access to their data

- quick access to their code

- using expensive compute efficiently

But to get all this set up for their research they need to wade through a ton of documentation about git, bash, docker containers, mountpoints, availability zones, cluster management and other low-level systems topics.

I think there's space for something like Replit or Vercel for ML researchers, and Moonglow is a (very early!) attempt at creating something like it.

bblcla · on Aug 23, 2024

Good question! I'm not too familiar with Zed, but here's my high-level guess from reading the website: we don't currently integrate with Zed, but we probably could if it supports remote kernels (the docs I found at https://zed.dev/docs/repl weren't specific about it).

One nice thing about our VSCode extension is that it's not just a remote kernel - our extension also lets you see what kernels you have and other details, so we'd need to write something like it for Zed. We probably wouldn't do this unless there's a lot of demand.

By the way, VSCode also supports the # %% repl spec and Moonglow does work with that (though we haven't optimized for it as much as we've optimized for the notebooks).

daft_pink · on Aug 23, 2024

My impression is that if it shows up by running the command Jupyter kernelspec list then it will work in zed out of the box. Does it show up on this list?

bblcla · on Aug 23, 2024

I don't think so, if that jupyter command just runs against your local servers. We register the moonglow remote servers with VSCode through the extension, and my guess is we'd need to do something similar with Zed.

bblcla · on Aug 23, 2024

We don't right now, but it's something a lot of people have asked for, so we're rolling out a file sync feature next week. (It will basically be a nice wrapper over rsync.)

bblcla · on June 25, 2024

This part in particular caught my eye:

> Step 2: Diagnosing broken machines

> As is typical in setting up large GPU clusters, we found that about 10% of the machines failed to boot, mostly due to physical issues with the servers. Some issues we encountered included: unconnected or miswired Ethernet cables, hardware issues in iDRAC, broken power supply units, bad NVME (nonvolatile memory express) drives, missing internal wires, and network cards or GPUs failing to show up.

This is a crazy high failure rate. Is this standard for traditional data centers too?

sgeisenh · on June 25, 2024

I did some of the work in the post (though mostly post-setup).

Speaking in generalities: the initial failure rates of these units are much higher than those of traditional non-GPU machines.

In general, the failure rates decline significantly during the operating life of hardware. So you deal with a bunch of issues up front that you try to resolve to reach a much more stable state.

There was a recent Meta engineering blog post that echoed some of our own experiences wrangling GPUs and high performance networks: https://engineering.fb.com/2024/06/12/data-infrastructure/tr...

bblcla · on June 25, 2024

I have also heard that failure rates on new GPUs are very high (approaching 20% if not burnt in), so that's unsurprising.

It's the other stuff I was more surprised about. I would have guessed that having your Ethernet cables plugged in and power supplies tested was table stakes nowadays. Then again, I've never been a datacenter admin...

bblcla · on April 11, 2024

Friends don't let friends become Vietnamese billionaires!