Get started with AI model inference using GKE Gen AI capabilities!

Home
Documentation
AI/ML orchestration on GKE

Stay organized with collections Save and categorize content based on your preferences.

AI/ML orchestration on GKE documentation

Run optimized AI/ML workloads with Google Kubernetes Engine (GKE) platform orchestration capabilities. With Google Kubernetes Engine (GKE), you can implement a robust, production-ready AI/ML platform with all the benefits of managed Kubernetes and these capabilities:

Infrastructure orchestration that supports GPUs and TPUs for training and serving workloads at scale.
Flexible integration with distributed computing and data processing frameworks.
Support for multiple teams on the same infrastructure to maximize utilization of resources

This page provides an overview of the AI/ML capabilities of GKE and how to get started running optimized AI/ML workloads on GKE with GPUs, TPUs, and frameworks like Hugging Face TGI, vLLM, and JetStream.

Get started for free

Start your proof of concept with $300 in free credit

Get access to Gemini 2.0 Flash Thinking
Free monthly usage of popular products, including AI APIs and BigQuery
No automatic charges, no commitment

View free product offers

Keep exploring with 20+ always-free products

Access 20+ free products for common use cases, including AI APIs, VMs, data warehouses, and more.

Documentation resources

Find quickstarts and guides, review key references, and get help with common issues.

info

Serve open models using GKE Gen AI capabilities

NEW!

About model inference on GKE
NEW!

Run best practice inference with GKE Inference Quickstart recipes
NEW!

Serve LLMs like Deepseek-R1 671B or Llama 3.1 405B on GKE
Tutorial

Serve Gemma using GPUs on GKE with vLLM
Tutorial

Serve an LLM using TPU Trillium on GKE with vLLM
Tutorial

Discover more tutorials for model inference on GKE

info

Orchestrate TPUs and GPUs at large scale

NEW!

Quickstart: Deploy GPU-accelerated Ray for AI workloads on GKE
NEW!

Optimize GKE resource utilization for mixed AI/ML training and inference workloads
Video

Introduction to Cloud TPUs for machine learning.
Video

Build large-scale machine learning on Cloud TPUs with GKE
Video

Serving Large Language Models with KubeRay on TPUs
Blog

Machine learning with JAX on Kubernetes with NVIDIA GPUs

info

Cost optimization and job orchestration

NEW!

Reference architecture for a batch processing platform on GKE
Best practice

Optimize GPU obtainability with flex-start provisioning mode
Blog

High performance AI/ML storage through Local SSD support on GKE
Blog

Simplifying MLOps using Weights & Biases with Google Kubernetes Engine
Best practice

Best practices for running batch workloads on GKE
Best practice

Run cost-optimized Kubernetes applications on GKE
Best practice

Improving launch time of Stable Diffusion on GKE by 4x

Related resources

Training and tutorials

Use cases

Code samples

Explore self-paced training from Google Cloud Skills Boost, use cases, reference architectures, and code samples with examples of how to use and connect Google Cloud services.

Training

Training and tutorials

Serve open source models using TPUs on GKE with Optimum TPU

Learn how to deploy LLMs using Tensor Processing Units (TPUs) on GKE with the Optimum TPU serving framework from Hugging Face.

Tutorial AI/ML Inference TPU

Learn more

Training

Training and tutorials

Create and use a volume backed by a Parallelstore instance in GKE

Learn how to create storage backed by fully managed Parallelstore instances, and access them as volumes. The CSI driver is optimized for AI/ML training workloads involving smaller file sizes and random reads.

Tutorial AI/ML Data Loading

Learn more

Training

Training and tutorials

Accelerate AI/ML data loading with Hyperdisk ML

Learn how to how to simplify and accelerate the loading of AI/ML model weights on GKE using Hyperdisk ML.

Tutorial AI/ML Data Loading

Learn more

Training

Training and tutorials

Serve an LLM using TPUs on GKE with JetStream and PyTorch

Learn how to serve a LLM using Tensor Processing Units (TPUs) on GKE with JetStream through PyTorch.

Tutorial AI/ML Inference TPUs

Learn more

Training

Training and tutorials

Best practices for optimizing LLM inference with GPUs on GKE

Learn best practices for optimizing LLM inference performance with GPUs on GKE using the vLLM and Text Generation Inference (TGI) serving frameworks.

Tutorial AI/ML Inference GPUs

Learn more

Training

Training and tutorials

Manage the GPU Stack with the NVIDIA GPU Operator on GKE

Learn when to use the NVIDIA GPU operator and how to enable the NVIDIA GPU Operator on GKE.

Tutorial GPUs

Learn more

Training

Training and tutorials

Configure autoscaling for LLM workloads on TPUs

Learn how to set up your autoscaling infrastructure by using the GKE Horizontal Pod Autoscaler (HPA) to deploy the Gemma LLM using single-host JetStream.

Tutorial TPUs

Learn more

Training

Training and tutorials

Fine-tune Gemma open models using multiple GPUs on GKE

Learn how to fine-tune Gemma LLM using GPUs on GKE with the Hugging Face Transformers library.

Tutorial AI/ML Inference GPUs

Learn more

Training

Training and tutorials

Deploy a Ray Serve application with a Stable Diffusion model on GKE with TPUs

Learn how to deploy and serve a Stable Diffusion model on GKE using TPUs, Ray Serve, and the Ray Operator add-on.

Tutorial AI/ML Inference Ray TPUs

Learn more

Training

Training and tutorials

Configure autoscaling for LLM workloads on GPUs with GKE

Learn how to set up your autoscaling infrastructure by using the GKE Horizontal Pod Autoscaler (HPA) to deploy the Gemma LLM with the Hugging Face Text Generation Interface (TGI) serving framework.

Tutorial GPUs

Learn more

Training

Training and tutorials

Train Llama2 with Megatron-LM on A3 Mega virtual machines

Learn how to run a container-based, Megatron-LM PyTorch workload on A3 Mega.

Tutorial AI/ML Training TPUs

Learn more

Training

Training and tutorials

Deploy GPU workloads in Autopilot

Learn how to request hardware accelerators (GPUs) in your GKE Autopilot workloads.

Tutorial GPUs

Learn more

Training

Training and tutorials

Serve a LLM with multiple GPUs in GKE

Learn how to serve Llama 2 70B or Falcon 40B using multiple NVIDIA L4 GPUs with GKE.

Tutorial AI/ML Inference GPUs

Learn more

Training

Training and tutorials

Getting started with Ray on GKE

Learn how to easily start using Ray on GKE by running a workload on a Ray cluster.

Tutorial Ray

Learn more

Training

Training and tutorials

Serve an LLM on L4 GPUs with Ray

Learn how to serve Falcon 7b, Llama2 7b, Falcon 40b, or Llama2 70b using the Ray framework in GKE.

Tutorial AI/ML Inference Ray GPUs

Learn more

Training

Training and tutorials

Orchestrate TPU Multislice workloads using JobSet and Kueue

Learn how to orchestrate a Jax workload on multiple TPU slices on GKE by using JobSet and Kueue.

Tutorial TPUs

Learn more

Training

Training and tutorials

Monitoring GPU workloads on GKE with NVIDIA Data Center GPU Manager (DCGM)

Learn how to observe GPU workloads on GKE with NVIDIA Data Center GPU Manager (DCGM).

Tutorial AI/ML Observability GPUs

Learn more

Training

Training and tutorials

Quickstart: Train a model with GPUs on GKE Standard clusters

This quickstart shows you how to deploy a training model with GPUs in GKE and store the predictions in Cloud Storage.

Tutorial AI/ML Training GPUs

Learn more

Training

Training and tutorials

Running large-scale machine learning on GKE

This video shows how GKE helps solve common challenges of training large AI models at scale, and the best practices for training and serving large-scale machine learning models on GKE.

Video AI/ML Training AI/ML Inference

Learn more

Training

Training and tutorials

TensorFlow on GKE Autopilot with GPU acceleration

This blog post is a step-by-step guide to the creation, execution, and teardown of a Tensorflow-enabled Jupiter notebook.

Blog AI/ML Training AI ML Inference GPUs

Learn more

Training

Training and tutorials

Implement a Job queuing system with quota sharing between namespaces on GKE

This tutorial uses Kueue to show you how to implement a Job queueing system, and configure workload resource and quota sharing between different namespaces on GKE.

Tutorial AI/ML Batch

Learn more

Training

Training and tutorials

Build a RAG chatbot with GKE and Cloud Storage

This tutorial shows you how to integrate a Large Language Model application based on retrieval-augmented generation with PDF files that you upload to a Cloud Storage bucket.

Tutorial AI/ML Data Loading

Learn more

Training

Training and tutorials

Analyze data on GKE using BigQuery, Cloud Run, and Gemma

This tutorial shows you how to analyze big datasets on GKE by leveraging BigQuery for data storage and processing, Cloud Run for request handling, and a Gemma LLM for data analysis and predictions.

Tutorial AI/ML Data Loading

Learn more

Use case

Use cases

Distributed data preprocessing with GKE and Ray: Scaling for the enterprise

Learn how to leverage GKE and Ray to efficiently preprocess large datasets for machine learning.

MLOps Training Ray

Learn more

Use case

Use cases

Data loading best practices for AI/ML inference on GKE

Learn how to speed up data loading times for your machine learning applications on Google Kubernetes Engine.

Inference Hyperdisk ML Cloud Storage FUSE

Learn more

Use case

Use cases

Save on GPUs: Smarter autoscaling for your GKE inferencing workloads

Learn how to optimize your GPU inference costs by fine-tuning GKE's Horizontal Pod Autoscaler for maximum efficiency.

Inference GPU HPA

Learn more

Use case

Use cases

Efficiently serve optimized AI models with NVIDIA NIM microservices on GKE

Learn how to deploy cutting-edge NVIDIA NIM microservices on GKE with ease and accelerate your AI workloads.

AI NVIDIA NIM

Learn more

Use case

Use cases

Accelerate Ray in production with new Ray Operator on GKE

Learn how Ray Operator on GKE simplifies your AI/ML production deployments, boosting performance and scalability.

AI TPU Ray

Learn more

Use case

Use cases

Maximize your LLM serving throughput for GPUs on GKE — a practical guide

Learn how to maximize large language model (LLM) serving throughput for GPUs on GKE, including infrastructure decisions and model server optimizations.

LLM GPU NVIDIA

Learn more

Use case

Use cases

Search engines made simple: A low-code approach with GKE and Vertex AI Agent Builder

How to build a search engine with Google Cloud, using Vertex AI Agent Builder, Vertex AI Search, and GKE.

Search Agent Vertex AI

Learn more

Use case

Use cases

LiveX AI reduces customer support costs with AI agents trained and served on GKE and NVIDIA AI

How LiveX AI uses GKE to build AI agents that enhance customer satisfaction and reduce costs.

GenAI NVIDIA GPU

Learn more

Use case

Use cases

Infrastructure for a RAG-capable generative AI application using GKE

Reference architecture for running a generative AI application with retrieval-augmented generation (RAG) using GKE, Cloud SQL, Ray, Hugging Face, and LangChain.

GenAI RAG Ray

Learn more

Use case

Use cases

Innovating in patent search: How IPRally leverages AI with GKE and Ray

How IPRally uses GKE and Ray to build a scalable, efficient ML platform for faster patent searches with better accuracy.

AI Ray GPU

Learn more

Use case

Use cases

Performance deep dive of Gemma on Google Cloud

Leverage Gemma on Cloud GPUs and Cloud TPUs for inference and training efficiency on GKE.

AI Gemma Performance

Learn more

Use case

Use cases

Gemma on GKE deep dive: New innovations to serve open generative AI models

Use best-in-class Gemma open models to build portable, customizable AI applications and deploy them on GKE.

AI Gemma Performance

Learn more

Use case

Use cases

Advanced scheduling for AI/ML with Ray and Kueue

Orchestrate Ray applications in GKE with KubeRay and Kueue.

Kueue Ray KubeRay

Learn more

Use case

Use cases

How to secure Ray on Google Kubernetes Engine

Apply security insights and hardening techniques for training AI/ML workloads using Ray on GKE.

AI Ray Security

Learn more

Use case

Use cases

Design storage for AI and ML workloads in Google Cloud

Select the best combination of storage options for AI and ML workloads on Google Cloud.

AI ML Storage

Learn more

Use case

Use cases

Automatic driver installation simplifies using NVIDIA GPUs in GKE

Automatically install Nvidia GPU drivers in GKE.

GPU NVIDIA Installation

Learn more

Use case

Use cases

Accelerate your generative AI journey with NVIDIA NeMo framework on GKEE

Train generative AI models using GKE and NVIDIA NeMo framework.

GenAI NVIDIA NeMo

Learn more

Use case

Use cases

Why GKE for your Ray AI workloads?

Improve scalability, cost-efficiency, fault tolerance, isolation, and portability by using GKE for Ray workloads.

AI Ray Scale

Learn more

Use case

Use cases

Running AI on fully managed GKE, now with new compute options, pricing and resource reservations

Gain improved GPU support, performance, and lower pricing for AI/ML workloads with GKE Autopilot.

GPU Autopilot Performance

Learn more

Use case

Use cases

How SEEN scaled output 89x and reduced GPU costs by 66% using GKE

Startup scales personalized video output with GKE.

GPU Scale Containers

Learn more

Use case

Use cases

How Spotify is unleashing ML Innovation with Ray and GKE

How Ray is transforming ML development at Spotify.

ML Ray Containers

Learn more

Use case

Use cases

How Ordaōs Bio takes advantage of generative AI on GKE

Ordaōs Bio, one of the leading AI accelerators for biomedical research and discovery, is finding solutions to novel immunotherapies in oncology and chronic inflammatory disease.

Performance TPU Cost optimization

Learn more

Use case

Use cases

GKE from a growing startup powered by ML

How Moloco, a Silicon Valley startup, harnessed the power of GKE and Tensor Flow Enterprise to supercharge its machine learning (ML) infrastructure.

ML Scale Cost optimization

Learn more

Code sample

Code Samples

Google Kubernetes Engine (GKE) Samples

View sample applications used in official GKE product tutorials.

Open GitHub

Code sample

Code Samples

GKE AI Labs Samples

View experimental samples for leveraging GKE to accelerate your AI/ML initiatives.

Open GitHub

Get started with AI model inference using GKE Gen AI capabilities!

AI/ML orchestration on GKE documentation

Start your proof of concept with $300 in free credit

Keep exploring with 20+ always-free products

Serve open models using GKE Gen AI capabilities

Orchestrate TPUs and GPUs at large scale

Cost optimization and job orchestration

Related videos