How many teams do you have using that cluster? How large is your operations team...

How many teams do you have using that cluster?

How large is your operations team?

What you’re saying makes it sound like you’re a one-person operation, or somewhere close to that scale. That obviously doesn’t have same requirements as much larger organizations.

I ran a job last night which provisioned a cluster with 4TB of RAM and nearly 1000 vCPUs. It ran for 20 minutes, ingested about 800 GB of data from nearly an million files, and was then deleted. To do that on a single cluster that’s also used for serving production requests would be unnecessarily complex and risky. Our production system has users in every timezone using the system 24x7. At the very least you’d have to provision separate node pools anyway, but why would you bother to do that?