Skip to content

Recommended Model and Feature Matrices

Although vLLM TPU’s new unified backend makes out-of-the-box high performance serving possible with any model supported in vLLM, the reality is that we're still in the process of implementing a few core components. For this reason, until we land more capabilities, we recommend starting from this list of stress tested models and features below.

We are still landing components in tpu-inference that will improve performance for larger scale, higher complexity models (XL MoE, +vision encoders, MLA, etc.).

If you’d like us to prioritize something specific, please submit a GitHub feature request here.

These tables show the models currently tested for accuracy and performance.

Models

Model Type UnitTest Accuracy/Correctness Benchmark
Qwen/Qwen2.5-VL-7B-Instruct Multimodal
Qwen/Qwen3-Omni-30B-A3B-Instruct Multimodal unverified unverified unverified
meta-llama/Llama-4-Maverick-17B-128E-Instruct Multimodal unverified unverified unverified
Qwen/Qwen3-30B-A3B Text
Qwen/Qwen3-32B Text
Qwen/Qwen3-4B Text
Qwen/Qwen3-Coder-480B-A35B-Instruct Text unverified unverified unverified
deepseek-ai/DeepSeek-V3.1 Text unverified unverified unverified
google/gemma-3-27b-it Text
meta-llama/Llama-3.1-8B-Instruct Text
meta-llama/Llama-3.3-70B-Instruct Text
meta-llama/Llama-Guard-4-12B Text
moonshotai/Kimi-K2-Thinking Text unverified unverified unverified
openai/gpt-oss-120b Text unverified unverified unverified