Cloud TPU Setup¶

This guide provides information on setting up and provisioning Google Cloud TPUs for use with tpu-inference.

TPU Versions and Topologies¶

Tensor Processing Units (TPUs) are Google's custom-developed application-specific integrated circuits (ASICs) used to accelerate machine learning workloads. TPUs are available in different versions each with different hardware specifications. For more information about TPUs, see TPU System Architecture.

The following TPU versions are compatible with tpu-inference:

Recommended¶

Experimental¶

These TPU versions allow you to configure the physical arrangements of the TPU chips. This can improve throughput and networking performance. For more information see:

Quota and Pricing¶

In order for you to use Cloud TPUs you need to have TPU quota granted to your Google Cloud project. For more information, see TPU quota.

For TPU pricing information, see Cloud TPU pricing.

You may need additional persistent storage for your TPU VMs. For more information, see Storage options for Cloud TPU data.

Provisioning Cloud TPUs¶

You can provision Cloud TPUs using the Cloud TPU API or the queued resources API (preferred). This section shows how to create TPUs using the queued resource API.

Provision a Cloud TPU with the queued resource API¶

Use the following command to provision a Cloud TPU. Replace the parameters in all caps with your own values.

gcloud alpha compute tpus queued-resources create QUEUED_RESOURCE_ID \
  --node-id TPU_NAME \
  --project PROJECT_ID \
  --zone ZONE \
  --accelerator-type ACCELERATOR_TYPE \
  --runtime-version RUNTIME_VERSION \
  --service-account SERVICE_ACCOUNT

Parameter name	Description
QUEUED_RESOURCE_ID	The user-assigned ID of the queued resource request.
TPU_NAME	The user-assigned name of the TPU which is created when the queued resource request is allocated.
PROJECT_ID	Your Google Cloud project
ZONE	The Google Cloud zone where you want to create your Cloud TPU. The value you use depends on the version of TPUs you are using. For more information, see TPU regions and zones
ACCELERATOR_TYPE	Specify the TPU version, for example `v5litepod-4` specifies a v5e TPU with 4 cores, `v6e-1` specifies a v6e TPU with 1 core. For more information, see TPU versions.
RUNTIME_VERSION	The TPU VM runtime version to use. For example, use `v2-alpha-tpuv6e` for a VM loaded with one or more v6e TPU(s). For more information, see TPU software versions
SERVICE_ACCOUNT	The email address for your service account. You can find it in the IAM Cloud Console under Service Accounts. For example: `tpu-service-account@<your_project_ID>.iam.gserviceaccount.com`

Connect to your TPU VM using SSH:

gcloud compute tpus tpu-vm ssh TPU_NAME --project PROJECT_ID --zone ZONE

Note

When configuring RUNTIME_VERSION ("TPU software version") for your TPU, ensure it matches the TPU generation you've selected by referencing the TPU VM images compatibility matrix. Using an incompatible version may prevent vLLM from running correctly.