San Francisco Compute has partnered with Modular to create Large Scale Inference, the best priced OpenAI-compatible inference in the world. On most open source models, we're 85%+ cheaper than other options. We built LSI in close partnership with a tier-1 AI lab to help them print trillions of tokens of synthetic data, saving them tens of millions of dollars compared to the leading competitor.
We make it easy to try before you buy.
Want to run a free batch job to test us on your own data?
Need an instant endpoint you can hit right away?
Looking for a tailored quote in under 2 hours?
We currently support the following models below the cost of every current provider on average. Exact prices, latency, and throughput depend on the use case & current market conditions. For a technical demo, get a quote.
| Model | Size | Price per 1M tokens |
|---|---|---|
gpt-oss-120b openai/gpt-oss-120bNew | 120B | $0.04 input$0.20 output |
gpt-oss-20b openai/gpt-oss-20bNew | 20B | $0.02 input$0.08 output |
Llama-3.1-405B-Instruct meta-llama/Llama-3.1-405B-Instruct | 405B | $0.50 input$1.50 output |
Llama-3.3-70B-Instruct meta-llama/Llama-3.3-70B-Instruct | 70B | $0.052 input$0.156 output |
Llama-3.1-8B-Instruct meta-llama/Meta-Llama-3.1-8B-Instruct | 8B | $0.008 input$0.02 output |
Llama 3.2 Vision meta-llama/Llama-3.2-11B-Vision-Instruct | 11B | $0.072 input$0.072 output |
Qwen-2.5-72B-Instruct Qwen/Qwen2.5-72B-Instruct | 72B | $0.065 input$1.25 output |
Qwen2.5-VL 72B Qwen/Qwen2.5-VL-72B-Instruct | 72B | $0.125 input$0.325 output |
Qwen2.5-VL 32B Qwen/Qwen2.5-VL-32B-Instruct | 32B | $0.125 input$0.325 output |
Qwen3 32B Qwen/Qwen3-32B | 32B | $0.05 input$0.15 output |
Qwen3 A3B 30B Qwen/Qwen3-30B-A3B-Instruct-2507 | 30B | $0.05 input$0.15 output |
Qwen 3-14B Qwen/Qwen3-14B | 14B | $0.04 input$0.12 output |
Qwen 3-8B Qwen/Qwen3-8B | 8B | $0.014 input$0.055 output |
QwQ-32B Qwen/QwQ-32B | 32B | $0.075 input$0.225 output |
Gemma-3-27B-in-chat google/gemma-3-27b-it | 27B | $0.05 input$0.15 output |
Gemma-3-12B-in-chat google/gemma-3-12b-it | 12B | $0.04 input$0.08 output |
Gemma-3-4B-in-chat google/gemma-3-4b-it | 4B | $0.016 input$0.032 output |
Mistral Small 3.2 2506 mistralai/Mistral-Small-3.2-24B-Instruct-2506 | 24B | $0.04 input$0.08 output |
Mistral Nemo 2407 mistralai/Mistral-Nemo-Instruct-2407 | 12B | $0.02 input$0.06 output |
InternVL3-78B OpenGVLab/InternVL3-78B | 78B | $0.125 input$0.325 output |
InternVL3-38B OpenGVLab/InternVL3-38B | 38B | $0.125 input$0.325 output |
InternVL3-14B OpenGVLab/InternVL3-14B | 14B | $0.072 input$0.072 output |
InternVL3-9B OpenGVLab/InternVL3-9B | 9B | $0.05 input$0.05 output |
DeepSeek-R1 deepseek-ai/DeepSeek-R1Coming Soon | 671B | $0.28 input$1.00 output |
DeepSeek-V3 deepseek-ai/DeepSeek-V3Coming Soon | 671B | $0.112 input$0.456 output |
Llama-4-Maverick-17B-128E-Instruct meta-llama/Llama-4-Maverick-17B-128E-InstructComing Soon | 400B | $0.075 input$0.425 output |
Llama-4-Scout-17B-Instruct meta-llama/Llama-4-Scout-17B-16E-InstructComing Soon | 109B | $0.05 input$0.25 output |
Qwen3 Coder A35B 480B Qwen/Qwen3-Coder-480B-A35B-InstructComing Soon | 480B | $0.32 input$1.25 output |
Qwen3 A22B 2507 235B Qwen/Qwen3-235B-A22B-Instruct-2507Coming Soon | 235B | $0.075 input$0.40 output |
Kimi K2 moonshotai/Kimi-K2-InstructComing Soon | 1T | $0.30 input$1.25 output |
GLM 4.5 zai-org/GLM-4.5Coming Soon | 358B | $0.30 input$1.10 output |
GLM 4.5 Air zai-org/GLM-4.5-AirComing Soon | 110B | $0.16 input$0.88 output |
GLM 4.5V zai-org/GLM-4.5VComing Soon | 108B | $0.30 input$0.90 output |
Didn't find what you were looking for? Get a custom quote for your use case
Unlike other providers, inference prices are market-based. The token price tracks the underlying market-based compute cost on SFC & current load. In other words, we give you the best available price.
Our MAX inference engine consistently outperforms other providers across key benchmarks like DocVQA, MathVista, and ChartQA on accuracy. We save you money and get you more accurate results while we do it.
LSI natively supports very large scale batch inference, with far higher rate limits & throughput than other providers. Unlike other services, we don't force you to upload petabytes of data to us. Our batch inference reads & writes to an S3-compatible object store, so your sensitive data isn't stored indefinitely on our servers. LSI natively handles multimodal use cases without forcing you to publicly share links to your content.
LSI is designed for large scale, mostly enterprise, use cases. That lets us be more hands on than traditional, self-serve providers.
Want a deployment behind your private network?
Need to hit specific latency, throughput, or uptime requirements?
Is there a model that's performing better in your evals, but we're not serving it?