Gemma-SRE: Self-Hosted vLLM Infrastructure Agent

--- title: Gemma-SRE: Self-Hosted vLLM Infrastructure Agent published: true date: 2026-03-26 19:33:26 UTC tags: gemma,mcpserver,TPUSprint,vllm series: AI canonical_url: https://medium.com/google-cloud/gemma-sre-self-hosted-vllm-infrastructure-agent-afda000c0943 --- Gemma-SRE is a high-performance, private DevOps and Site Reliability Engineering (SRE) assistant designed for the TPU Sprint. It leverages Gemma 2 models, self-hosted via vLLM on Google Cloud Run GPUs (NVIDIA L4) or GKE TPUs (v5e), to bridge the gap between cloud infrastructure and intelligent automation. ![](https://cdn-images-1.medium.com/max/1024/1*NYTrw-WFDKdGQ4RfwM_Idg.png) #### What is this project trying to Do? By integrating directly with Google Cloud Logging and the Vertex AI Model Registry, the SRE agent autonomously analyzes system errors, suggests production-ready remediation plans, and manages its own scalable inference stack. The project features a robust Model-as-Code approach, providing automated deployment templates that include optimized configurations for startup probes, GCS FUSE storage, and XLA-accelerate performance, ensuring a secure and cost-efficient private inference endpoint for any Google Cloud project. #### Is this yet another Python MCP Demo? Yes. Python has traditionally been the main coding language for ML and AI tools. The goal of this article is to provide a minimal viable vLLM MCP server that can be used with tools such as Gemini CLI or the Agent Development Kit (ADK). #### So — What does this Project Actually Do ? This project is an automated SRE and DevOps assistant that uses a private, self-hosted AI (Gemma) to help you manage and troubleshoot Google Cloud infrastructure for deploying self-hosted models. Specifically, it performs three main functions: #### AI-Driven Troubleshooting The agent connects to Google Cloud Logging to identify errors in your environment. It uses a self-hosted vLLM inference server to: - Summarize Logs: Turn raw, messy error logs into high-level summaries of what is actually failing. - Suggest Remediation: Provide expert 3-step fix plans for specific incidents (like memory limits being exceeded or container crashes). #### Infrastructure Automation The project includes a full "Inference Stack" manager. It can: - Deploy vLLM to Cloud Run: Automatically generate and execute commands to deploy an NVIDIA L4 GPU-accelerated server - Manage Scaling: Adjust min/max instances and check the health of the inference service - Generate Configs: Create Kubernetes (GKE) manifests for running vLLM on TPU v5e chips. #### Model Weight Management It bridges the gap between model providers and your private cloud: - Vertex AI & Kaggle Integration: Provides guides and commands to pull Gemma model weights from Vertex AI Model Garden or Kaggle - GCS FUSE Setup: Handles the complex configuration needed to mount a Google Cloud Storage bucket as a local disk for vLLM, allowing you to run models without manually baking them into container images. In short: It is an MCP (Model Context Protocol) server that gives an LLM (like Claude or Gemini) the "hands" to see your logs and the "tools" to deploy and manage its own TPU and GPU-powered inference infrastructure. #### Python Version Management One of the downsides of the wide deployment of Python has been managing the language versions across platforms and maintaining a supported version. The **pyenv** tool enables deploying consistent versions of Python: [GitHub - pyenv/pyenv: Simple Python version management](https://github.com/pyenv/pyenv) As of writing — the mainstream python version is 3.13. To validate your current Python: ```console xbill@penguin:~$ python --version Python 3.13.12 xbill@penguin:~$ pyenv version 3.13.12 (set by /home/xbill/.pyenv/version) ``` #### Gemini CLI If not pre-installed you can download the Gemini CLI to interact with the source files and provide real-time assistance: ```shell npm install -g @google/gemini-cli ``` #### Testing the Gemini CLI Environment Once you have all the tools and the correct Node.js version in place- you can test the startup of Gemini CLI. You will need to authenticate with a Key or your Google Account: ```plaintext gemini ▝▜▄ Gemini CLI v0.35.0 ▝▜▄ ▗▟▀ Signed in with Google: [email protected] /auth ▝▀ Plan: Gemini Code Assist in Google One AI Pro /upgrade ``` #### Node Version Management Gemini CLI needs a consistent, up to date version of Node. The **nvm** command can be used to get a standard Node environment: [GitHub - nvm-sh/nvm: Node Version Manager - POSIX-compliant bash script to manage multiple active node.js versions](https://github.com/nvm-sh/nvm) #### Python MCP Documentation The official GitHub Repo provides samples and documentation for getting started: [GitHub - modelcontextprotocol/python-sdk: The official Python SDK for Model Context Protocol servers and clients](https://github.com/modelcontextprotocol/python-sdk) #### Where do I start? The strategy for starting MCP development for vLLM management is a incremental step by step approach. First, the basic development environment is setup with the required system variables, and a working Gemini CLI configuration. Then, a minimal Python MCP Server is built with stdio transport. This server is validated with Gemini CLI in the local environment. This setup validates the connection from Gemini CLI to the local server via MCP. The MCP client (Gemini CLI) and the Python MCP server both run in the same local environment. #### Setup the Basic Environment At this point you should have a working Python environment and a working Gemini CLI installation. The next step is to clone the GitHub samples repository with support scripts: ```shell cd ~ git clone https://github.com/xbill9/aisprintapr2026 ``` Then run **init.sh** from the cloned directory. The script will attempt to determine your shell environment and set the correct variables: ```shell cd aisprintapr2026 source init.sh ``` If your session times out or you need to re-authenticate- you can run the **set\_env.sh** script to reset your environment variables: ```shell cd aisprintapr2026 source set_env.sh ``` Variables like PROJECT\_ID need to be setup for use in the various build scripts- so the set\_env script can be used to reset the environment if you time-out. #### vLLM Management Tool with MCP Stdio Transport One of the key features that the standard MCP libraries provide is abstracting various transport methods. The high level MCP tool implementation is the same no matter what low level transport channel/method that the MCP Client uses to connect to a MCP Server. The simplest transport that the SDK supports is the stdio (stdio/stdout) transport — which connects a locally running process. Both the MCP client and MCP Server must be running in the same environment. The connection over stdio will look similar to this: ```python # Initialize FastMCP server mcp = FastMCP("Self-Hosted vLLM DevOps Agent") ``` #### Running the Python Code First- switch the directory with the Python version of the MCP sample code: ```shell ~/aisprintapr2026/self-hosted-vllm-devops-agent ``` Run the release version on the local system: ```make xbill@penguin:~/aisprintapr2026/self-hosted-vllm-devops-agent$ make install Processing ./. ``` The project can also be linted: ```make xbill@penguin:~/aisprintapr2026/self-hosted-vllm-devops-agent$ make lint ruff check . All checks passed! ruff format --check . 5 files already formatted mypy . Success: no issues found in 5 source files ``` And a test run: ```make xbill@penguin:~/aisprintapr2026/self-hosted-vllm-devops-agent$ make test python test_agent.py 📡 Automatically discovered vLLM at: https://vllm-gemma-2b-it-wgcq55zbfq-uk.a.run.app .......... ---------------------------------------------------------------------- Ran 10 tests in 0.029s OK ``` #### vLLM Interaction with MCP stdio Transport One of the key features that the MCP protocol provides is abstracting various transport methods. The high level tool MCP implementation is the same no matter what low level transport channel/method that the MCP Client uses to connect to a MCP Server. The simplest transport that the SDK supports is the stdio (stdio/stdout) transport — which connects a locally running process. Both the MCP client and MCP Server must be running in the same environment. In this project Gemini CLI is used as the MCP client to interact with the Python MCP server code. #### Gemini CLI settings.json Replace the default Gemini CLI configuration file — **settings.json** with a pre-configured sample: ```json { "mcpServers": { "vllm-devops-agent": { "command": "python3", "args": ["/home/xbill/aisprintapr2026/self-hosted-vllm-devops-agent/server.py"], "env": { "GOOGLE_CLOUD_PROJECT": "aisprint-491218", "GOOGLE_CLOUD_LOCATION": "us-east4", "VLLM_BASE_URL": "https://vllm-gemma-2b-it-wgcq55zbfq-uk.a.run.app", "MODEL_NAME": "/mnt/models/gemma-2b-it" } } } } ``` #### Validation with Gemini CLI The final connection test uses Gemini CLI as a MCP client with the Python code providing the MCP server: ```plaintext ▝▜▄ Gemini CLI v0.35.0 ▝▜▄ ▗▟▀ Signed in with Google: [email protected] /auth ▝▀ Plan: Gemini Code Assist in Google One AI Pro /upgrade ``` Run the environment script to configure the correct variables: ```plaintext > /mcp list Configured MCP servers: 🟢 vllm-devops-agent - Ready (15 tools, 1 resource) Tools: - mcp_vllm-devops-agent_analyze_cloud_logging - mcp_vllm-devops-agent_deploy_vllm - mcp_vllm-devops-agent_destroy_vllm - mcp_vllm-devops-agent_get_kaggle_model_copy_instructions - mcp_vllm-devops-agent_get_kagglehub_download_path - mcp_vllm-devops-agent_get_vertex_ai_model_copy_instructions - mcp_vllm-devops-agent_get_vllm_deployment_config - mcp_vllm-devops-agent_get_vllm_endpoint - mcp_vllm-devops-agent_get_vllm_tpu_deployment_config - mcp_vllm-devops-agent_list_bucket_models - mcp_vllm-devops-agent_list_vertex_models - mcp_vllm-devops-agent_query_vllm - mcp_vllm-devops-agent_status_vllm - mcp_vllm-devops-agent_suggest_sre_remediation - mcp_vllm-devops-agent_update_vllm_scaling Resources: - get_deployment_template (config://vllm-deployment-template) [text/plain] ``` #### vLLM Lifecycle Management via MCP The MCP tools provide a complete suite of agent-oriented operations for managing vLLM deployment on Cloud Run or a TPU. #### Overview of Tools ```markdown ## 🛠 Available Tools The following tools are available via the MCP server: ### Infrastructure Management * **`deploy_vllm`** : Deploys a new vLLM instance on Cloud Run GPU. * **`destroy_vllm`** : Deletes the Cloud Run vLLM service. * **`status_vllm`** : Checks the health and status of the vLLM service. * **`update_vllm_scaling`** : Adjusts the minimum and maximum instances for scaling. * **`get_vllm_deployment_config`** : Generates `gcloud` commands for deployment. ### Model Management * **`list_vertex_models`** : Lists models in the Vertex AI Registry. * **`list_bucket_models`** : Lists model artifacts in your GCS bucket. * **`get_vertex_ai_model_copy_instructions`** : Guide to transfer Gemma models from Vertex AI. * **`get_kaggle_model_copy_instructions`** : Guide to transfer Gemma models from Kaggle. ### AI & Operations * **`analyze_cloud_logging`** : Summarizes error logs using self-hosted vLLM. * **`suggest_sre_remediation`** : Provides 3-step plans for SRE incidents. * **`query_vllm`** : Sends direct prompts to the self-hosted Gemma model. * **`get_vllm_tpu_deployment_config`** : Generates GKE manifests for TPU v5e. ``` **mcp\_vllm-devops-agent\_deploy\_vllm** - deploy vLLM This tool handles the deployment of the LLM: ```markdown ✦ I will deploy the vLLM service using the default configuration (vllm-gemma-2b-it model from the aisprint-491218-bucket). ✦ I will check the status of the Cloud Run service using make status to determine if the deployment is still in progress or if it failed. ╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ ✓ Shell make status [current working directory /home/xbill/aisprintapr2026/self-hosted-vllm-devops-agent] (Check if the deployment i… │ │ │ │ 🔍 Checking status of Cloud Run service vllm-gemma-2b-it in us-east4... │ │ gcloud run services describe vllm-gemma-2b-it \ │ │ --project=aisprint-491218 \ │ │ --region=us-east4 │ │ ERROR: (gcloud.run.services.describe) Cannot find service [vllm-gemma-2b-it] │ │ make: *** [Makefile:63: status] Error 1 │ ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ ✦ I will list the available GCS buckets in the project to verify that the target bucket exists. ╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ ✓ Shell gcloud storage buckets list --project=aisprint-491218 [current working directory /home/xbill/aisprintapr2026/self-hosted-vll… │ │ │ │ ... first 37 lines hidden (Ctrl+O to show) ... │ │ location_type: region │ │ metageneration: 1 │ │ name: aisprint-491218-bucket │ │ public_access_prevention: inherited │ │ satisfies_pzs: true │ │ soft_delete_policy: │ │ effectiveTime: '2026-03-25T00:31:13.502000+00:00' │ │ retentionDurationSeconds: '604800' │ │ storage_url: gs://aisprint-491218-bucket/ │ │ uniform_bucket_level_access: false │ │ update_time: 2026-03-25T00:31:13+0000 │ ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ ✦ I will list the contents of the gemma-2b-it folder in the GCS bucket to ensure the model artifacts are present. ╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ ✓ Shell gcloud storage ls gs://aisprint-491218-bucket/gemma-2b-it/ [current working directory /home/xbill/aisprintapr2026/self-hoste… │ │ │ │ gs://aisprint-491218-bucket/gemma-2b-it/config.json │ │ gs://aisprint-491218-bucket/gemma-2b-it/gemma-2b-it.gguf │ │ gs://aisprint-491218-bucket/gemma-2b-it/generation_config.json │ │ gs://aisprint-491218-bucket/gemma-2b-it/model-00001-of-00002.safetensors │ │ gs://aisprint-491218-bucket/gemma-2b-it/model-00002-of-00002.safetensors │ │ gs://aisprint-491218-bucket/gemma-2b-it/model.safetensors.index.json │ │ gs://aisprint-491218-bucket/gemma-2b-it/special_tokens_map.json │ │ gs://aisprint-491218-bucket/gemma-2b-it/tokenizer.json │ │ gs://aisprint-491218-bucket/gemma-2b-it/tokenizer.model │ │ gs://aisprint-491218-bucket/gemma-2b-it/tokenizer_config.json │ ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ ✦ I will now run the deployment command manually using make deploy to observe the output and any potential error messages. ╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ ✓ Shell make deploy [current working directory /home/xbill/aisprintapr2026/self-hosted-vllm-devops-agent] (Execute the deployment di… │ │ │ │ ... first 18 lines hidden (Ctrl+O to show) ... │ │ --add-volume-mount=volume=model-volume,mount-path=/mnt/models \ │ │ --args=--model=/mnt/models/gemma-2b-it,--max-model-len=4096,--trust-remote-code,--gpu-memory-utilization=0.9,--host=0.0.0.0 \ │ │ --no-allow-unauthenticated \ │ │ --region=us-east4 │ │ Deploying container to Cloud Run service [vllm-gemma-2b-it] in project [aisprint-491218] region [us-east4] │ │ ✓ Deploying new service... Done. │ │ ✓ Creating Revision... │ │ ✓ Routing traffic... │ │ Done. │ │ Service [vllm-gemma-2b-it] revision [vllm-gemma-2b-it-00001-zqc] has been deployed and is serving 100 percent of traffic. │ │ Service URL: https://vllm-gemma-2b-it-289270257791.us-east4.run.app │ ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ ✦ I will now check the status of the newly deployed Cloud Run service using make status. ╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ ✓ Shell make status [current working directory /home/xbill/aisprintapr2026/self-hosted-vllm-devops-agent] (Verify the status of the … │ │ │ │ ... first 35 lines hidden (Ctrl+O to show) ... │ │ Service account: [email protected] │ │ Concurrency: 4 │ │ Max instances: 1 │ │ Timeout: 3600s │ │ CPU Allocation: CPU is always allocated │ │ Execution Environment: Second Generation │ │ Volumes: │ │ model-volume │ │ type: cloud-storage │ │ bucket: aisprint-491218-bucket │ │ read-only: True │ ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ ✦ The vLLM service vllm-gemma-2b-it has been successfully deployed and is now active at: URL: https://vllm-gemma-2b-it-289270257791.us-east4.run.app Summary of deployment: - Model: gemma-2b-it (loaded from gs://aisprint-491218-bucket/gemma-2b-it) - GPU: 1x NVIDIA L4 - Resources: 4 CPU, 16Gi Memory - Status: Active and serving 100% of traffic. You can now use the status_vllm tool (or make status) at any time to check its health and configuration. ? for shortcuts ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ``` #### mcp\_vllm-devops-agent\_destroy\_vllm This tool handles clean-up and de-allocation of the vLLM. #### **mcp\_vllm-devops-agent\_get\_kaggle\_model\_copy\_instructions mcp\_vllm-devops-agent\_get\_kagglehub\_download\_path mcp\_vllm-devops-agent\_get\_vertex\_ai\_model\_copy\_instructions mcp\_vllm-devops-agent\_list\_bucket\_models mcp\_vllm-devops-agent\_list\_vertex\_models** These Tools provide instructions on downloading Model files and available Models. #### **mcp\_vllm-devops-agent\_get\_vllm\_deployment\_config** This tool generates the deployment parameters. ```shell > mcp_vllm-devops-agent_get_vllm_deployment_config ✦ I will generate the deployment configuration for vLLM on Cloud Run using the default settings. ✦ The deployment command for the vllm-gemma-2b-it service on Cloud Run is: 1 gcloud beta run deploy vllm-gemma-2b-it \ 2 --image=vllm/vllm-openai:latest \ 3 --gpu=1 \ 4 --gpu-type=nvidia-l4 \ 5 --no-gpu-zonal-redundancy \ 6 --no-cpu-throttling \ 7 --concurrency=4 \ 8 --timeout=3600 \ 9 --startup-probe=timeoutSeconds=60,periodSeconds=60,failureThreshold=10,initialDelaySeconds=180,httpGet.port=8000,httpGet.path=/health \ 10 --max-instances=1 \ 11 --min-instances=0 \ 12 --port=8000 \ 13 --memory=16Gi \ 14 --cpu=4 \ 15 --execution-environment=gen2 \ 16 --add-volume=name=model-volume,type=cloud-storage,bucket=aisprint-bucket,readonly=true \ 17 --add-volume-mount=volume=model-volume,mount-path=/mnt/models \ 18 --args=--model=/mnt/models/gemma-2b-it,--max-model-len=4096,--trust-remote-code,--gpu-memory-utilization=0.9,--host=0.0.0.0 \ 19 --no-allow-unauthenticated \ 20 --region=us-east4 ``` #### **mcp\_vllm-devops-agent\_get\_vllm\_endpoint** This tool gets the endpoint for the vLLM deployment. ```plaintext > get_vllm_endpoint ✦ I will retrieve the current active vLLM endpoint URL for the vllm-gemma-2b-it service. ╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ ✓ get_vllm_endpoint (vllm-devops-agent MCP Server) {"service_name":"vllm-gemma-2b-it"} │ │ │ │ https://vllm-gemma-2b-it-wgcq55zbfq-uk.a.run.app │ ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ ✦ The current active vLLM endpoint URL is: https://vllm-gemma-2b-it-wgcq55zbfq-uk.a.run.app ? for shortcuts ``` #### **mcp\_vllm-devops-agent\_status\_vllm** This tool checks the status of the LLM. ```plaintext > agent_status_vllm ✦ I will check the status of the vllm-gemma-2b-it Cloud Run service. ╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ ✓ status_vllm (vllm-devops-agent MCP Server) {"service_name":"vllm-gemma-2b-it"} │ │ │ │ Status for vllm-gemma-2b-it: │ │ status: │ │ conditions: │ │ - lastTransitionTime: '2026-03-25T22:31:53.860782Z' │ │ status: 'True' │ │ type: Ready │ │ - lastTransitionTime: '2026-03-25T19:00:51.414639Z' │ │ status: 'True' │ │ type: ConfigurationsReady │ │ - lastTransitionTime: '2026-03-25T22:31:53.832098Z' │ │ status: 'True' │ │ type: RoutesReady │ │ latestCreatedRevisionName: vllm-gemma-2b-it-00001-zqc │ │ url: https://vllm-gemma-2b-it-wgcq55zbfq-uk.a.run.app │ │ │ ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ ✦ The vllm-gemma-2b-it service is Ready. - URL: https://vllm-gemma-2b-it-wgcq55zbfq-uk.a.run.app - Latest Revision: vllm-gemma-2b-it-00001-zqc - Status: All conditions (Ready, ConfigurationsReady, RoutesReady) are True. ``` #### **mcp\_vllm-devops-agent\_query\_vllm** This tool formats and queries the deployed LLM. ```plaintext > mcp_vllm-devops-agent_query_vllm why is the sky blue? ✦ I will query the vLLM model to explain why the sky is blue. ╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ ✓ query_vllm (vllm-devops-agent MCP Server) {"prompt":"why is the sky blue?"} │ │ │ │ vLLM Response │ │ │ │ The sky appears blue due to Rayleigh scattering. Rayleigh scattering is the scattering of light waves by particles of a similar size │ │ to the wavelength of light. This means that blue light, with a shorter wavelength, is scattered more strongly than longer wavelengths. │ │ │ │ The blue light scattered by the air molecules is scattered in all directions, but it is scattered more strongly in the direction of │ │ the sun. This is because the sun's light has a longer path to travel through the air than the longer wavelengths of light. │ │ │ │ As a result, the sky appears blue. │ │ │ │ Rayleigh scattering is a very important phenomenon in the atmosphere. It is responsible for the blue color of the sky, as well as for │ │ the blue colors of many other objects, such as the ocean and the eyes of birds. │ ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ ✦ The vLLM model explains that the sky appears blue due to Rayleigh scattering, where air molecules scatter shorter wavelengths of light (like blue) more strongly in all directions than longer wavelengths. ``` #### **mcp\_vllm-devops-agent\_analyze\_cloud\_logging** This tool provides visibility into the model logs. ```markdown ✦ I have completed the log analysis for your self-hosted vLLM deployment. 📋 Cloud Logging Analysis Summary Recent logs revealed three primary categories of events: 1. Resolved Container Startup Failures (Port Misconfiguration): Earlier logs from 2026-03-25T02:38Z show that the vllm-gemma-2b-it service failed to start because it was trying to listen on the default Cloud Run port (8080), while the vLLM container is configured to use port 8000. This caused the default startup TCP probe to fail with DEADLINE_EXCEEDED. 2. Idempotency Errors (Bucket Creation): Several ERROR logs from 16:28Z and 18:25Z indicate that automated scripts attempted to create the GCS bucket aisprint-491218-bucket when it already existed. These errors are non-critical and purely informational regarding resource idempotency. 3. Current Deployment Status (Healthy): The service was successfully redeployed at 22:31Z (Revision: vllm-gemma-2b-it-00001-zqc). It is now correctly configured to use: * Port: 8000 * Health Check: /health on port 8000 * Scaling: min-instances: 1 (Warm GPU instance) 🚀 Verification I have verified that the endpoint is functional: - Status: Ready - URL: https://vllm-gemma-2b-it-wgcq55zbfq-uk.a.run.app - Test Query: Successfully responded with "The capital of France is Paris." The system is currently stable and performing as expected. ``` #### **mcp\_vllm-devops-agent\_suggest\_sre\_remediation** This tool reviews the logs and provides suggestions on fixing deployment issues. ```markdown ✦ I have successfully used the suggest_sre_remediation tool to analyze the container startup failure (PORT=8080 timeout error). 🩺 SRE Remediation Plan for Container Startup Failure 1. Root Cause Identification * Log Correlation: Check vllm-gemma-2b-it logs for a mismatch between the container's listening port and the Cloud Run default port. * Port Check: Confirm if vLLM is using its default port (8000) while the Cloud Run environment expects port 8080. * Timeout Check: Verify if the startup-probe timeout is sufficient for the vLLM model (Gemma-2b-it) to load into GPU memory. 2. Corrective Actions * Port Alignment: Update the Cloud Run service configuration to listen on port 8000 or explicitly set the --port argument in the vLLM command. * Timeout Extension: Increase the initialDelaySeconds and timeoutSeconds for the startupProbe to account for large model weight loading from GCS FUSE. * Scaling Check: Ensure that gpu-memory-utilization is not set too high (default 0.9), as this can cause OOM errors during initialization. 3. Verification & Validation * Redeploy: Apply the corrected configuration using gcloud run services update. * Health Probe: Verify the /health endpoint returns 200 OK on port 8000. * Uptime Monitoring: Confirm the service stays Ready after the initial 180s delay. The service is currently healthy after the latest redeployment with the corrected port settings. ``` #### **mcp\_vllm-devops-agent\_update\_vllm\_scaling** This tool allows model scaling. ```markdown > mcp_vllm-devops-agent_update_vllm_scaling ✦ I can update the scaling configuration for your vLLM service. Currently, it is set to min-instances: 1 and max-instances: 1. What would you like to set the new limits to? * Min Instances: (e.g., 0 to save costs when idle, or 1 to avoid cold starts) * Max Instances: (e.g., 1 to stay within GPU quota, or higher if available) ``` #### **mcp\_vllm-devops-agent\_get\_vllm\_tpu\_deployment\_config** This MCP Tool generates the required deployment settings to migrate the vLLM from Cloud Run with TPU to a dedicated TPU: ```yaml ✦ 🌀 vLLM on TPU v5e (GKE Deployment) To deploy vLLM on TPUs, use the following GKE manifest. This configuration targets a TPU v5e-8 (8 chips), which is ideal for Gemma 2 9B or 27B. 1. Create a TPU Node Pool (if not exists) 1 gcloud container node-pools create tpu-v5e-8 \ 2 --cluster=tpu-cluster \ 3 --location=us-east4 \ 4 --machine-type=ct5lp-hightpu-4t \ 5 --tpu-topology=2x4 \ 6 --num-nodes=1 2. Kubernetes Manifest (vllm-tpu.yaml) 1 apiVersion: apps/v1 2 kind: Deployment 3 metadata: 4 name: vllm-tpu 5 spec: 6 replicas: 1 7 selector: 8 matchLabels: 9 app: vllm-tpu 10 template: 11 metadata: 12 labels: 13 app: vllm-tpu 14 spec: 15 nodeSelector: 16 cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice 17 cloud.google.com/gke-tpu-topology: 2x4 18 containers: 19 - name: vllm-tpu 20 image: vllm/vllm-tpu:latest 21 resources: 22 limits: 23 google.com/tpu: "8" 24 requests: 25 google.com/tpu: "8" 26 env: 27 - name: VLLM_XLA_CACHE_PATH 28 value: "/tmp/vllm_xla_cache" 29 command: ["python3", "-m", "vllm.entrypoints.openai.api_server"] 30 args: 31 - "--model=google/gemma-2-9b-it" 32 - "--tensor-parallel-size=8" 33 - "--max-model-len=8192" 34 ports: 35 - containerPort: 8000 36 volumeMounts: 37 - name: dshm 38 mountPath: /dev/shm 39 volumes: 40 - name: dshm 41 emptyDir: 42 medium: Memory 43 --- 44 apiVersion: v1 45 kind: Service 46 metadata: 47 name: vllm-tpu-service 48 spec: 49 selector: 50 app: vllm-tpu 51 ports: 52 - protocol: TCP 53 port: 80 54 targetPort: 8000 55 type: ClusterIP 3. Deployment Steps 1. Save the YAML above to vllm-tpu.yaml. 2. Apply it: kubectl apply -f vllm-tpu.yaml. 3. (Optional) If using a private model, ensure a Hugging Face token is provided via secret. ? for shortcuts ``` #### Running the Tools Stand-Alone A sample Python script is also provided to run the tools directly: ```python xbill@penguin:~/aisprintapr2026/self-hosted-vllm-devops-agent$ python demo_launcher.py 2026-03-25 20:41:57,541 - vllm-devops-agent - INFO - Initializing DevOps Agent MCP Server... 🚀 TPU Sprint Demo: Self-Hosted vLLM DevOps Agent ============================================================ [Step 1] Analyzing Cloud Logging errors (severity=ERROR)... 2026-03-25 20:42:01,520 - vllm-devops-agent - INFO - Attempting to discover vLLM URL for service: vllm-gemma-2b-it 2026-03-25 20:42:02,432 - vllm-devops-agent - INFO - 📡 Automatically discovered vLLM at: https://vllm-gemma-2b-it-wgcq55zbfq-uk.a.run.app ANALYSIS: ### Log Analysis (Self-Hosted vLLM) - The first log indicates a failed startup of a container due to a missing port for the container. - The second log indicates a failed startup of a container due... [Step 2] Proposing remediation for 'MemoryLimitExceeded'... REMEDIATION: ### Remediation Plan **Step 1: Investigate the OOMKilled error** * Review the logs for the pod that crashed. * Check the resource usage of the pod, including memory, CPU, and network. * Use tools like `kubectl top pods` or `kubectl top pods --all-namespaces` to monitor resource usage. * Identify any bottlenecks or resource requests that may have caused the OOMKilled error. **Step 2: Identify the root cause of the OOMKilled error** * Analyze the logs and resource usage data to identify the specific cause of the OOMKilled error. * This could be due to a memory leak, a resource starvation issue, or a combination of factors. **Step 3: Implement a solution to address the root cause** * Once you have identified the root cause, implement a solution to address it. * This could involve increasing the available memory for the pod, reducing the resource requests, or optimizing the application code. **Additional Considerations:** * It is important to take a proactive approach to prevent OOMKilled errors. * Consider using resource monitoring tools to continuously monitor resource usage and identify potential bottlenecks. * Implement a monitoring system to alert on OOMKilled events and take corrective actions. * Review the application code and ensure that it is efficient and does not allocate more resources than necessary. [Step 3] Vertex AI Model Garden Instructions... ### 🚀 Transferring gemma-2b-it from Vertex AI Model Garden To use vLLM with Cloud Storage FUSE without Hugging Face, follow these steps: 1. **Accept Terms:** Go to the Vertex AI Model Garden page for Gemma (https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/335) and click 'Accept' on the license agreement. 2. **Download via Signed URL:** After accepting, the console provides a 'Download' button or a signed URL. 3. **Transfer to GCS:** If you have the artifacts locally after downloading from the console, use: `gcloud storage cp -r ./model_artifacts/* gs://aisprint-491218-bucket/gemma-2b-it/` 4. **Alternative (Direct GCS Copy):** Google occasionally provides a managed GCS path for verified projects. If accessible, you can use: `gcloud storage cp -r gs://vertex-ai-models/gemma/gemma-2b-it/* gs://aisprint-491218-bucket/gemma-2b-it/` Once the artifacts are in your bucket, use `get_vllm_deployment_config` to generate your Cloud Run deployment command. [Step 4] Generating Cloud Run GPU Deployment Config (with GCS FUSE)... COMMAND: gcloud beta run deploy vllm-sre-agent --image=vllm/vllm-openai:latest --gpu=1 --gpu-type=nvidia-l4 --no-gpu-zonal-redundancy --no-cpu-throttling --concurrency=4 --timeout=3600 --startup-probe=timeoutSeconds=60,periodSeconds=60,failureThreshold=10,initialDelaySeconds=180,httpGet.port=8000,httpGet.path=/health --max-instances=1 --min-instances=0 --port=8000 --memory=16Gi --cpu=4 --execution-environment=gen2 --add-volume=name=model-volume,type=cloud-storage,bucket=my-gemma-bucket,readonly=true --add-volume-mount=volume=model-volume,mount-path=/mnt/models --args=--model=/mnt/models/gemma-2b-it,--max-model-len=4096,--trust-remote-code,--gpu-memory-utilization=0.9,--host=0.0.0.0 --no-allow-unauthenticated --region=us-east4 [Step 5] Listing available Vertex AI Models (using ADK/SDK)... No models found in Vertex AI Model Registry.... [Step 6] Reading MCP Resource (vLLM Deployment Template)... TEMPLATE (first 100 chars): # Cloud Run vLLM Deployment Template # Required: Second Generation execution environment # Required:... ============================================================ ✅ DevOps Agent Demo Complete: Self-hosted SRE intelligence ready! xbill@penguin:~/aisprintapr2026/self-hosted-vllm-devops-agent$ ``` #### One Last Review with Gemini CLI ```markdown > summarize the project ✦ This project is an MCP (Model Context Protocol) Server designed as an automated DevOps/SRE assistant. It leverages self-hosted Gemma models running on vLLM (Cloud Run GPU) to analyze Google Cloud infrastructure logs and provide actionable remediation strategies. Core Capabilities * Log Intelligence: Automatically fetches and summarizes Cloud Logging errors (severity=ERROR) to identify root causes using private LLM inference. * Incident Response: Generates expert 3-step remediation plans for specific SRE incidents (e.g., OOMKilled, connectivity issues). * Infrastructure Lifecycle: Provides tools to deploy, scale, and destroy vLLM inference stacks on Cloud Run using NVIDIA L4 GPUs and GCS FUSE. * Model Portability: Includes automated workflows to transfer Gemma weights from Vertex AI Model Garden or Kaggle to private GCS buckets for use with vLLM. Technical Architecture * Inference Stack: vLLM running on Cloud Run (gen2) with 16GiB RAM and 4 vCPUs. It uses GCS FUSE to mount model weights directly from Cloud Storage. * Agent Logic: Built with FastMCP, integrating google-cloud-logging for telemetry and the Vertex AI SDK (ADK) for model registry management. * Automation: A comprehensive Makefile handles environment setup, GPU deployment, and interactive model querying. The system is optimized for Private AI Operations, ensuring that infrastructure logs are analyzed by a private model endpoint rather than being sent to a public API. ``` #### Summary The strategy for using MCP for vLLM Devops across Cloud Run and TPU with Gemini CLI was validated with a incremental step by step approach. A minimal stdio transport MCP Server was started from Python source code and validated with Gemini CLI running as a MCP client in the same local environment. This Python server provided all of the management tools to deploy and troubleshoot vLLM deployments. This approach can be extended to more complex deployments using other MCP transports and Cloud based options. * * *

Gemma-SRE: Self-Hosted vLLM Infrastructure Agent

Tags

Comments

More Blog

Skills over System Prompts: Building an Anki Tutor with the Antigravity SDK

Congrats to the Hermes Agent Challenge Winners!

Firebase Midsommer Madnesss with Antigravity CLI

I'm not a developer, but I built a calendar app to fix my most annoying work task

Congrats to the Gemma 4 Challenge Winners!

Building an agentic PR reviewer with Antigravity SDK