Large Scale Inference

San Francisco Compute has partnered with Modular to create Large Scale Inference, the best priced OpenAI-compatible inference in the world. On most open source models, we're 85%+ cheaper than other options. We built LSI in close partnership with a tier-1 AI lab to help them print trillions of tokens of synthetic data, saving them tens of millions of dollars compared to the leading competitor.

Get a Large Scale Inference quote

Try it, then scale

We make it easy to try before you buy.

  • Want to run a free batch job to test us on your own data?

  • Need an instant endpoint you can hit right away?

  • Looking for a tailored quote in under 2 hours?

You got it. Billing is straightforward: pay only for the tokens you actually use.

Supported models

We currently support the following models below the cost of every current provider on average. Exact prices, latency, and throughput depend on the use case & current market conditions. For a technical demo, get a quote.

ModelSizePrice per 1M tokens
gpt-oss-120b
New
openai/gpt-oss-120b
120B$0.04 input$0.20 output
gpt-oss-20b
New
openai/gpt-oss-20b
20B$0.02 input$0.08 output
Llama-3.1-405B-Instruct
meta-llama/Llama-3.1-405B-Instruct
405B$0.50 input$1.50 output
Llama-3.3-70B-Instruct
meta-llama/Llama-3.3-70B-Instruct
70B$0.052 input$0.156 output
8B$0.008 input$0.02 output
11B$0.072 input$0.072 output
Qwen-2.5-72B-Instruct
Qwen/Qwen2.5-72B-Instruct
72B$0.065 input$1.25 output
72B$0.125 input$0.325 output
32B$0.125 input$0.325 output
32B$0.05 input$0.15 output
30B$0.05 input$0.15 output
Qwen 3-14B
Qwen/Qwen3-14B
14B$0.04 input$0.12 output
Qwen 3-8B
Qwen/Qwen3-8B
8B$0.014 input$0.055 output
32B$0.075 input$0.225 output
Gemma-3-27B-in-chat
google/gemma-3-27b-it
27B$0.05 input$0.15 output
Gemma-3-12B-in-chat
google/gemma-3-12b-it
12B$0.04 input$0.08 output
Gemma-3-4B-in-chat
google/gemma-3-4b-it
4B$0.016 input$0.032 output
24B$0.04 input$0.08 output
12B$0.02 input$0.06 output
78B$0.125 input$0.325 output
38B$0.125 input$0.325 output
14B$0.072 input$0.072 output
9B$0.05 input$0.05 output
DeepSeek-R1
Coming Soon
deepseek-ai/DeepSeek-R1
671B$0.28 input$1.00 output
DeepSeek-V3
Coming Soon
deepseek-ai/DeepSeek-V3
671B$0.112 input$0.456 output
Llama-4-Maverick-17B-128E-Instruct
Coming Soon
meta-llama/Llama-4-Maverick-17B-128E-Instruct
400B$0.075 input$0.425 output
Llama-4-Scout-17B-Instruct
Coming Soon
meta-llama/Llama-4-Scout-17B-16E-Instruct
109B$0.05 input$0.25 output
Qwen3 Coder A35B 480B
Coming Soon
Qwen/Qwen3-Coder-480B-A35B-Instruct
480B$0.32 input$1.25 output
Qwen3 A22B 2507 235B
Coming Soon
Qwen/Qwen3-235B-A22B-Instruct-2507
235B$0.075 input$0.40 output
Kimi K2
Coming Soon
moonshotai/Kimi-K2-Instruct
1T$0.30 input$1.25 output
GLM 4.5
Coming Soon
zai-org/GLM-4.5
358B$0.30 input$1.10 output
GLM 4.5 Air
Coming Soon
zai-org/GLM-4.5-Air
110B$0.16 input$0.88 output
GLM 4.5V
Coming Soon
zai-org/GLM-4.5V
108B$0.30 input$0.90 output

Didn't find what you were looking for? Get a custom quote for your use case

The best available pricing

Unlike other providers, inference prices are market-based. The token price tracks the underlying market-based compute cost on SFC & current load. In other words, we give you the best available price.

Summarization
Summarize content up to 67% cheaper
Coding
Generate code up to 2x cheaper
Tool use
Use tools & MCPs in agents up to 67% cheaper
Real-time
Get real-time responses up to 2x cheaper
Visual understanding
Analyze charts, docs, and images up to 41% cheaper

The best available accuracy

Our MAX inference engine consistently outperforms other providers across key benchmarks like DocVQA, MathVista, and ChartQA on accuracy. We save you money and get you more accurate results while we do it.

Up to
10%
higher accuracy
than other providers

Built for trillion-token, sensitive, multimodal use cases

LSI natively supports very large scale batch inference, with far higher rate limits & throughput than other providers. Unlike other services, we don't force you to upload petabytes of data to us. Our batch inference reads & writes to an S3-compatible object store, so your sensitive data isn't stored indefinitely on our servers. LSI natively handles multimodal use cases without forcing you to publicly share links to your content.

Bespoke enterprise support

LSI is designed for large scale, mostly enterprise, use cases. That lets us be more hands on than traditional, self-serve providers.

  • Want a deployment behind your private network?

  • Need to hit specific latency, throughput, or uptime requirements?

  • Is there a model that's performing better in your evals, but we're not serving it?

Between Modular's world-class engineering & SFC's dramatic price optimization, we'll work with you to get the best possible price & performance.

© 2025 San Francisco Compute
PrivacyTermsDPA
made with 💙 in SF
    Large Scale Inference | SF Compute