Recommended Model and Feature Matrices¶

Although vLLM TPU’s new unified backend makes out-of-the-box high performance serving possible with any model supported in vLLM, the reality is that we're still in the process of implementing a few core components. For this reason, until we land more capabilities, we recommend starting from this list of stress tested models and features below.

We are still landing components in tpu-inference that will improve performance for larger scale, higher complexity models (XL MoE, +vision encoders, MLA, etc.).

If you’d like us to prioritize something specific, please submit a GitHub feature request here.

Recommended Models¶

These tables show the models currently tested for accuracy and performance.

Models¶

Model	Type	UnitTest	Accuracy/Correctness	Benchmark
Qwen/Qwen2.5-VL-7B-Instruct	Multimodal	✅	✅	✅
Qwen/Qwen3-Omni-30B-A3B-Instruct	Multimodal	unverified	unverified	unverified
meta-llama/Llama-4-Maverick-17B-128E-Instruct	Multimodal	unverified	unverified	unverified
Qwen/Qwen3-30B-A3B	Text	✅	✅	✅
Qwen/Qwen3-32B	Text	✅	✅	✅
Qwen/Qwen3-4B	Text	✅	✅	✅
Qwen/Qwen3-Coder-480B-A35B-Instruct	Text	unverified	unverified	unverified
deepseek-ai/DeepSeek-V3.1	Text	unverified	unverified	unverified
google/gemma-3-27b-it	Text	✅	✅	✅
meta-llama/Llama-3.1-8B-Instruct	Text	✅	✅	✅
meta-llama/Llama-3.3-70B-Instruct	Text	✅	✅	✅
meta-llama/Llama-Guard-4-12B	Text	✅	✅	✅
moonshotai/Kimi-K2-Thinking	Text	unverified	unverified	unverified
openai/gpt-oss-120b	Text	unverified	unverified	unverified