Kubernetes v4.33.0, Jul 7 26

Viewing docs for Kubernetes v4.33.0
published on Tuesday, Jul 7, 2026 by Pulumi

pulumi/pulumi-kubernetes

Self-Host Gemma 4 with Open WebUI and Tailscale

Viewing docs for Kubernetes v4.33.0
published on Tuesday, Jul 7, 2026 by Pulumi

Schema (JSON)

pulumi/pulumi-kubernetes

Prerequisites

Install Pulumi.
Install Python 3.9 or later.
Install kubectl, k3d, and llama.cpp. On macOS with Homebrew, install llama.cpp with brew install llama.cpp.
To expose Open WebUI through Tailscale, sign in to Tailscale and create OAuth client credentials that can create auth keys.
Make sure your machine has enough memory for the selected GGUF model.

Deploy the App

Step 1: Start a local Kubernetes cluster

Create a k3d cluster that lets pods reach services running on the host:

k3d cluster create pulumi-gemma4 \
  --api-port 6550 \
  --agents 1 \
  --port "30000:30000@loadbalancer" \
  --host-alias "host.k3d.internal:host-gateway"

Step 2: Start llama.cpp on the host

Run the OpenAI-compatible llama.cpp server on the host. The example defaults expect port 18080 because 8080 is commonly used by other local services. Build llama.cpp from source so the server can load the Gemma 4 12B multimodal projector:

brew install cmake git

llm_home="$HOME/pulumi-gemma4-llm"
mkdir -p "$llm_home/models" "$llm_home/logs"

if [ ! -d "$llm_home/llama.cpp/.git" ]; then
  git clone --depth 1 https://github.com/ggml-org/llama.cpp.git "$llm_home/llama.cpp"
fi

cmake -S "$llm_home/llama.cpp" \
  -B "$llm_home/llama.cpp/build" \
  -DGGML_METAL=ON \
  -DGGML_BLAS=ON \
  -DCMAKE_BUILD_TYPE=Release

cmake --build "$llm_home/llama.cpp/build" --target llama-server -j 10

curl -L --fail \
  --output "$llm_home/models/mmproj-F16.gguf" \
  https://huggingface.co/unsloth/gemma-4-12b-it-GGUF/resolve/main/mmproj-F16.gguf

Then start the server:

"$HOME/pulumi-gemma4-llm/llama.cpp/build/bin/llama-server" \
  --hf-repo unsloth/gemma-4-12b-it-GGUF \
  --hf-file gemma-4-12b-it-Q8_0.gguf \
  --mmproj "$HOME/pulumi-gemma4-llm/models/mmproj-F16.gguf" \
  --host 127.0.0.1 \
  --port 18080 \
  --ctx-size 131072 \
  --parallel 1 \
  --jinja \
  --reasoning off

Check that the server is available:

curl http://127.0.0.1:18080/v1/models

With --mmproj, /v1/models should report capabilities: ["completion","multimodal"]. In local validation, Open WebUI accepted an uploaded image and Gemma 4 described it correctly. A small WAV file also worked through the OpenAI-compatible input_audio request shape, though llama.cpp logs still mark audio input as experimental.

To keep llama.cpp running after reboot, put the startup script and logs under your home directory and register a launchd agent:

llm_home="$HOME/pulumi-gemma4-llm"
mkdir -p "$llm_home/logs" "$HOME/Library/LaunchAgents"

cat > "$llm_home/start-llama-server.sh" <<'EOF'
#!/bin/zsh
set -euo pipefail

export PATH="/opt/homebrew/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin"

exec "$HOME/pulumi-gemma4-llm/llama.cpp/build/bin/llama-server" \
  --hf-repo unsloth/gemma-4-12b-it-GGUF \
  --hf-file gemma-4-12b-it-Q8_0.gguf \
  --mmproj "$HOME/pulumi-gemma4-llm/models/mmproj-F16.gguf" \
  --host 127.0.0.1 \
  --port 18080 \
  --ctx-size 131072 \
  --parallel 1 \
  --jinja \
  --reasoning off
EOF

chmod +x "$llm_home/start-llama-server.sh"

cat > "$HOME/Library/LaunchAgents/com.pulumi.gemma4.llama-server.plist" <<EOF
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
  <key>Label</key>
  <string>com.pulumi.gemma4.llama-server</string>
  <key>ProgramArguments</key>
  <array>
    <string>$llm_home/start-llama-server.sh</string>
  </array>
  <key>WorkingDirectory</key>
  <string>$llm_home</string>
  <key>RunAtLoad</key>
  <true/>
  <key>KeepAlive</key>
  <true/>
  <key>StandardOutPath</key>
  <string>$llm_home/logs/llama-server.out.log</string>
  <key>StandardErrorPath</key>
  <string>$llm_home/logs/llama-server.err.log</string>
</dict>
</plist>
EOF

launchctl bootout gui/$(id -u)/com.pulumi.gemma4.llama-server 2>/dev/null || true
launchctl bootstrap gui/$(id -u) "$HOME/Library/LaunchAgents/com.pulumi.gemma4.llama-server.plist"
launchctl kickstart -k gui/$(id -u)/com.pulumi.gemma4.llama-server

Check the service and logs:

launchctl print gui/$(id -u)/com.pulumi.gemma4.llama-server
tail -f "$HOME/pulumi-gemma4-llm/logs/llama-server.err.log"

Unload the service when you no longer want it to run in the background:

launchctl bootout gui/$(id -u)/com.pulumi.gemma4.llama-server

Step 3: Install Python dependencies

python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Step 4: Configure Pulumi

Create a stack:

pulumi stack init dev

Tailscale exposure is optional. To enable it, set enableTailscale and provider credentials. Use either an API key or OAuth credentials supported by the Pulumi Tailscale provider:

pulumi config set enableTailscale true
pulumi config set tailscale:oauthClientId <client-id>
pulumi config set tailscale:oauthClientSecret <client-secret> --secret

If your llama.cpp server uses a different host or port, update the host runtime settings:

pulumi config set hostLlmHostname host.k3d.internal
pulumi config set hostLlmPort 18080

Step 5: Deploy Open WebUI

pulumi up

Pulumi exports the Open WebUI NodePort URL and the internal LLM base URL. When enableTailscale is true, it also exports the Tailscale URL for the web UI.

Cluster Runtime

The default runtimeMode is host, which keeps model inference on the host. Linux GPU hosts can run llama.cpp inside Kubernetes instead:

pulumi config set runtimeMode cluster
pulumi config set llmBaseUrl http://llm-server:8080/v1
pulumi config set gpuVendor nvidia
pulumi config set gpuCount 1
pulumi up

Cluster mode downloads the configured GGUF into a persistent volume and runs llama.cpp with CUDA or ROCm images.

Clean Up

Destroy the Pulumi stack:

pulumi destroy
pulumi stack rm

Delete the local k3d cluster:

k3d cluster delete pulumi-gemma4

Stop the local llama-server process when you are done.

Summary

You now have Open WebUI running in Kubernetes and Gemma 4 running through host-native llama.cpp. When enableTailscale is true, Pulumi also manages secure remote access through Tailscale.

Viewing docs for Kubernetes v4.33.0
published on Tuesday, Jul 7, 2026 by Pulumi

Schema (JSON)

pulumi/pulumi-kubernetes

Self-Host Gemma 4 with Open WebUI and Tailscale

On this page

On this page

Prerequisites

Deploy the App

Step 1: Start a local Kubernetes cluster

Step 2: Start llama.cpp on the host

Step 3: Install Python dependencies

Step 4: Configure Pulumi

Step 5: Deploy Open WebUI

Cluster Runtime

Clean Up

Summary

On this page

On this page

Try Pulumi Cloud free.
Your team will thank you.

Self-Host Gemma 4 with Open WebUI and Tailscale

On this page

On this page

Prerequisites

Deploy the App

Step 1: Start a local Kubernetes cluster

Step 2: Start llama.cpp on the host

Step 3: Install Python dependencies

Step 4: Configure Pulumi

Step 5: Deploy Open WebUI

Cluster Runtime

Clean Up

Summary

On this page

On this page

Try Pulumi Cloud free.Your team will thank you.

Try Pulumi Cloud free.
Your team will thank you.