Large Scale Inference

San Francisco Compute has partnered with Modular to create Large Scale Inference, the best priced OpenAI-compatible inference in the world. On most open source models, we're 85%+ cheaper than other options. We built LSI in close partnership with a tier-1 AI lab to help them print trillions of tokens of synthetic data, saving them tens of millions of dollars compared to the leading competitor.

Get a quote for your use case

The best available pricing

Unlike other providers, inference prices are market-based. The token price tracks the underlying market-based compute cost on SFC & current load. In other words, we give you the best available price.

Summarization

Summarize content up to
67% cheaper

Coding

Generate code up to
2x cheaper

Tool use

Use tools & MCPs in agents up to
67% cheaper

Real-time

Get real-time responses up to
2x cheaper

Visual understanding

Analyze charts, docs, and images up to 41% cheaper

The best available accuracy

Our MAX inference engine consistently outperforms other providers across key benchmarks like DocVQA, MathVista, and ChartQA on accuracy. We save you money and get you more accurate results while we do it.

Up to

10%

higher accuracy

than other providers

Built for trillion-token, sensitive, multimodal use cases

LSI natively supports very large scale batch inference, with far higher rate limits & throughput than other providers. Unlike other services, we don't force you to upload petabytes of data to us. Our batch inference reads & writes to an S3-compatible object store, so your sensitive data isn't stored indefinitely on our servers. LSI natively handles multimodal use cases without forcing you to publicly share links to your content.

Bespoke enterprise support

LSI is designed for large scale, mostly enterprise, use cases. That lets us be more hands on than traditional, self-serve providers.

Want a deployment behind your private network?
Need to hit specific latency, throughput, or uptime requirements?
Is there a model that's performing better in your evals, but we're not serving it?

Between Modular's world-class engineering & SFC's dramatic price optimization, we'll work with you to get the best possible price & performance.

Supported models

We currently support the following models below the cost of every current provider on average. Exact prices, latency, and throughput depend on the use case & current market conditions. For a quote & technical demo, contact us.

Model	Hugging Face Name	Size
gpt‑oss‑120b New	openai/gpt-oss-120b	117B
gpt‑oss‑20b New	openai/gpt-oss-20b	21B
DeepSeek‑R1	deepseek-ai/DeepSeek-R1	671B
DeepSeek‑V3	deepseek-ai/DeepSeek-V3	671B
DeepSeek‑R1‑Distill‑Llama‑70B	deepseek-ai/DeepSeek-R1-Distill-Llama-70B	70B
Llama‑3‑70B‑chat	meta-llama/Llama-3-70b-chat-hf	70B
Llama‑3.1‑405B‑Instruct	meta-llama/Meta-Llama-3.1-405B-Instruct	405B
Llama‑3.1‑70B‑Instruct	meta-llama/Meta-Llama-3.1-70B-Instruct	70B
Llama‑3.1‑8B‑Instruct	meta-llama/Meta-Llama-3.1-8B-Instruct	8B
Llama‑4‑Scout‑17B‑Instruct	meta-llama/Llama-4-Scout-17B-16E-Instruct	109B
Llama‑4‑Maverick‑17B‑128E‑Instruct	meta-llama/Llama-4-Maverick-17B-128E-Instruct	400B
Llama 3.2 Vision	meta-llama/Llama-3.2-11B-Vision-Instruct	11B
Mistral‑7B‑Instruct	mistralai/Mistral-7B-Instruct-v0.1	7B
Mixtral‑8x7B‑Instruct	mistralai/Mixtral-8x7B-Instruct-v0.1	56B
Mistral‑Small‑24B‑Instruct	mistralai/Mistral-Small-24B-Instruct-2501	24B
Qwen‑2.5‑72B‑Instruct	Qwen/Qwen2.5-72B-Instruct	72.7B
Qwen‑2.5‑7B‑Instruct	Qwen/Qwen2.5-7B-Instruct	7B
Qwen 3‑14B	Qwen/Qwen3-14B	14.8B
Qwen 3‑8B	Qwen/Qwen3-8B	8.2B
QwQ‑32B	Qwen/QwQ-32B	32.5B
InternVL3‑9B	OpenGVLab/InternVL3-9B	9B
InternVL3‑14B	OpenGVLab/InternVL3-14B	14B
InternVL3‑38B	OpenGVLab/InternVL3-38B	38B
InternVL3‑78B	OpenGVLab/InternVL3-78B	78B
Gemma‑3‑12B‑in‑chat	google/gemma-3-12b-it	12B
Gemma‑3‑27B‑in‑chat	google/gemma-3-27b-it	27B

Need a custom setup? Contact Us

Privacy Terms

made with 💙 in SF