We run affordable pre-training clusters for startups, grad students, research labs, other cloud providers, and large enterprises.
We sell compute by the hour, not by the year. Need 768 H100s for a week? You got it. Want to do 32 runs on 8 H100s for 2 hours each? Sure. You can flexibly burst to large portions of the cluster so you can scale up your model without having to sign an expensive long-term contract.
Yes. If you buy 10 nodes, you’ll get a 10 node cluster with 3.2tb/s IB.
It’s highly optimized VMs that perform like bare metal. One VM per host at the moment. You can SSH into each node and setup services like Kubernetes, Slurm, or your own orchestration stack.
Yes. Right now it’s CEPH, but we’ll be rolling out other options soon.
Yes. Hardware failure rates are much higher on GPU clusters than on web servers. At certain scales, they’re guaranteed, so we’ve designed for failure. We have strict hardware requirements and have seen just about everything that can go wrong. Unlike other providers, we refund for failed nodes and can repack your nodes to ones with healthy hardware.
If you buy a big enough cluster, we’ll setup a shared slack channel with our engineering team. In the case of emergencies, we give you a phone number that escalates pages. Our team has managed over a billion dollars of hardware and worked on several of the world’s supercomputers.
We offer competitive pricing for large reservations of 512 - 4096 GPUs, on contracts with volume pricing that you can exit. Contact sales and let us know what you’re looking for.