Cluster Requirements

#

San Francisco Compute lists clusters from a number of data centers that have passed our audit and signed a contract with us that provides a strong SLA, which we then pass onto you. Our audit typically (but not always) involves visiting the data center to inspect physical security, cabling, and cooling.

For example, here are some of the requirements for listing a training cluster, such as the h100i instance type. If you're interested in listing, please contact us for full cluster requirements.

  • 1TB of RAM minimum
  • 2x CPUs from an approved list
  • Three types of networks
    • Primary in-band network for orchestration
    • Out of band IPMI management network isolated from the primary network
    • High bandwidth RDMA compute fabric, typically InfiniBand
  • At least 1Gbit/s per node of internet bandwidth, with redundant bond or failover uplink
  • Managed in-band ethernet switches from an approved list of vendors
  • Air or water cooling, no immersion cooling.
  • UFM access
  • Public IPs for each node
  • Proper burn-in, with a 48 hour cluster acceptance criteria
  • At least 2TB of highly available, high IOPs, networked NVMe storage per GPU compute node

Our VM infrastructure

#

When you buy from San Francisco Compute, you're buying SSH access to VMs. In most cases, there is one VM per node. We do not support multiple tenants on a single node.

Our infrastructure is called sfcd, or San Francisco Compute Daemon. It's a VM orchestrator we built in-house. Every node contains an SFCD instance under the VM in a low-resource process. It's written in Rust. We have done extensive testing to show that it does not add noticeable performance overhead.

Getting a test node

#

In rare cases, we can provide test nodes before you purchase from San Francisco Compute. However, the advantage of SFC is that you can buy whatever configuration you'd like for a short time period in order to run tests. We would encourage you to try this out first, with the knowledge that you're covered by an SLA.

Instance types

#

Today, we only support one instance type: h100i. These are clusters with Nvidia H100s. They have 3.2tb/s InfiniBand. They're fully interconnected on a single RDMA fabric.

You can purchase one by running:

sf buy -t h100i -n 1 -s 'tomorrow at 10am' -d '1d'

Hardware failures & refunds

#

Expect your cluster to break. Failure rates on large scale GPU clusters are far higher than what you may be used to on web servers. Fear not! When (not if) a portion of your cluster breaks, we will attempt to provide a hotswapped node. If we can't, we'll refund the purchase.

On a normal GPU cloud, that would be it. You simply have one less node. However, SFC is a market. That means, in many cases, you can just buy another node with your refund. The price is not guaranteed to be the same as when you bought it, but we think this is a better experience than simply being out of luck.