1. Packages
  2. Packages
  3. Kubernetes
  4. How-to Guides
  5. Self-Host Gemma 4 with Open WebUI and Tailscale
Viewing docs for Kubernetes v4.32.0
published on Monday, Jun 8, 2026 by Pulumi

Self-Host Gemma 4 with Open WebUI and Tailscale

kubernetes logo
Viewing docs for Kubernetes v4.32.0
published on Monday, Jun 8, 2026 by Pulumi

    View Code Deploy this example with Pulumi

    This example deploys Open WebUI to Kubernetes, connects it to a local llama.cpp server running Gemma 4, and can expose the web UI through Tailscale. It is designed for a Mac or workstation where host-native inference is faster and simpler than running the model inside the Kubernetes cluster.

    The default model is unsloth/gemma-4-12b-it-GGUF with gemma-4-12b-it-Q8_0.gguf. The default runtime uses the host machine for inference and k3d for Open WebUI. Tailscale exposure is opt-in so you can preview and deploy the local path before configuring Tailscale credentials.

    Prerequisites

    1. Install Pulumi.
    2. Install Python 3.9 or later.
    3. Install kubectl, k3d, and llama.cpp. On macOS with Homebrew, install llama.cpp with brew install llama.cpp.
    4. To expose Open WebUI through Tailscale, sign in to Tailscale and create OAuth client credentials that can create auth keys.
    5. Make sure your machine has enough memory for the selected GGUF model.

    Deploy the App

    Step 1: Start a local Kubernetes cluster

    Create a k3d cluster that lets pods reach services running on the host:

    k3d cluster create pulumi-gemma4 \
      --api-port 6550 \
      --agents 1 \
      --port "30000:30000@loadbalancer" \
      --host-alias "host.k3d.internal:host-gateway"
    

    Step 2: Start llama.cpp on the host

    Run the OpenAI-compatible llama.cpp server on the host. The example defaults expect port 18080 because 8080 is commonly used by other local services. Build llama.cpp from source so the server can load the Gemma 4 12B multimodal projector:

    brew install cmake git
    
    llm_home="$HOME/pulumi-gemma4-llm"
    mkdir -p "$llm_home/models" "$llm_home/logs"
    
    if [ ! -d "$llm_home/llama.cpp/.git" ]; then
      git clone --depth 1 https://github.com/ggml-org/llama.cpp.git "$llm_home/llama.cpp"
    fi
    
    cmake -S "$llm_home/llama.cpp" \
      -B "$llm_home/llama.cpp/build" \
      -DGGML_METAL=ON \
      -DGGML_BLAS=ON \
      -DCMAKE_BUILD_TYPE=Release
    
    cmake --build "$llm_home/llama.cpp/build" --target llama-server -j 10
    
    curl -L --fail \
      --output "$llm_home/models/mmproj-F16.gguf" \
      https://huggingface.co/unsloth/gemma-4-12b-it-GGUF/resolve/main/mmproj-F16.gguf
    

    Then start the server:

    "$HOME/pulumi-gemma4-llm/llama.cpp/build/bin/llama-server" \
      --hf-repo unsloth/gemma-4-12b-it-GGUF \
      --hf-file gemma-4-12b-it-Q8_0.gguf \
      --mmproj "$HOME/pulumi-gemma4-llm/models/mmproj-F16.gguf" \
      --host 127.0.0.1 \
      --port 18080 \
      --ctx-size 131072 \
      --parallel 1 \
      --jinja \
      --reasoning off
    

    Check that the server is available:

    curl http://127.0.0.1:18080/v1/models
    

    With --mmproj, /v1/models should report capabilities: ["completion","multimodal"]. In local validation, Open WebUI accepted an uploaded image and Gemma 4 described it correctly. A small WAV file also worked through the OpenAI-compatible input_audio request shape, though llama.cpp logs still mark audio input as experimental.

    To keep llama.cpp running after reboot, put the startup script and logs under your home directory and register a launchd agent:

    llm_home="$HOME/pulumi-gemma4-llm"
    mkdir -p "$llm_home/logs" "$HOME/Library/LaunchAgents"
    
    cat > "$llm_home/start-llama-server.sh" <<'EOF'
    #!/bin/zsh
    set -euo pipefail
    
    export PATH="/opt/homebrew/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin"
    
    exec "$HOME/pulumi-gemma4-llm/llama.cpp/build/bin/llama-server" \
      --hf-repo unsloth/gemma-4-12b-it-GGUF \
      --hf-file gemma-4-12b-it-Q8_0.gguf \
      --mmproj "$HOME/pulumi-gemma4-llm/models/mmproj-F16.gguf" \
      --host 127.0.0.1 \
      --port 18080 \
      --ctx-size 131072 \
      --parallel 1 \
      --jinja \
      --reasoning off
    EOF
    
    chmod +x "$llm_home/start-llama-server.sh"
    
    cat > "$HOME/Library/LaunchAgents/com.pulumi.gemma4.llama-server.plist" <<EOF
    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
    <plist version="1.0">
    <dict>
      <key>Label</key>
      <string>com.pulumi.gemma4.llama-server</string>
      <key>ProgramArguments</key>
      <array>
        <string>$llm_home/start-llama-server.sh</string>
      </array>
      <key>WorkingDirectory</key>
      <string>$llm_home</string>
      <key>RunAtLoad</key>
      <true/>
      <key>KeepAlive</key>
      <true/>
      <key>StandardOutPath</key>
      <string>$llm_home/logs/llama-server.out.log</string>
      <key>StandardErrorPath</key>
      <string>$llm_home/logs/llama-server.err.log</string>
    </dict>
    </plist>
    EOF
    
    launchctl bootout gui/$(id -u)/com.pulumi.gemma4.llama-server 2>/dev/null || true
    launchctl bootstrap gui/$(id -u) "$HOME/Library/LaunchAgents/com.pulumi.gemma4.llama-server.plist"
    launchctl kickstart -k gui/$(id -u)/com.pulumi.gemma4.llama-server
    

    Check the service and logs:

    launchctl print gui/$(id -u)/com.pulumi.gemma4.llama-server
    tail -f "$HOME/pulumi-gemma4-llm/logs/llama-server.err.log"
    

    Unload the service when you no longer want it to run in the background:

    launchctl bootout gui/$(id -u)/com.pulumi.gemma4.llama-server
    

    Step 3: Install Python dependencies

    python3 -m venv venv
    source venv/bin/activate
    pip install -r requirements.txt
    

    Step 4: Configure Pulumi

    Create a stack:

    pulumi stack init dev
    

    Tailscale exposure is optional. To enable it, set enableTailscale and provider credentials. Use either an API key or OAuth credentials supported by the Pulumi Tailscale provider:

    pulumi config set enableTailscale true
    pulumi config set tailscale:oauthClientId <client-id>
    pulumi config set tailscale:oauthClientSecret <client-secret> --secret
    

    If your llama.cpp server uses a different host or port, update the host runtime settings:

    pulumi config set hostLlmHostname host.k3d.internal
    pulumi config set hostLlmPort 18080
    

    Step 5: Deploy Open WebUI

    pulumi up
    

    Pulumi exports the Open WebUI NodePort URL and the internal LLM base URL. When enableTailscale is true, it also exports the Tailscale URL for the web UI.

    Cluster Runtime

    The default runtimeMode is host, which keeps model inference on the host. Linux GPU hosts can run llama.cpp inside Kubernetes instead:

    pulumi config set runtimeMode cluster
    pulumi config set llmBaseUrl http://llm-server:8080/v1
    pulumi config set gpuVendor nvidia
    pulumi config set gpuCount 1
    pulumi up
    

    Cluster mode downloads the configured GGUF into a persistent volume and runs llama.cpp with CUDA or ROCm images.

    Clean Up

    Destroy the Pulumi stack:

    pulumi destroy
    pulumi stack rm
    

    Delete the local k3d cluster:

    k3d cluster delete pulumi-gemma4
    

    Stop the local llama-server process when you are done.

    Summary

    You now have Open WebUI running in Kubernetes and Gemma 4 running through host-native llama.cpp. When enableTailscale is true, Pulumi also manages secure remote access through Tailscale.

    kubernetes logo
    Viewing docs for Kubernetes v4.32.0
    published on Monday, Jun 8, 2026 by Pulumi

      Try Pulumi Cloud free.
      Your team will thank you.

      Start free trial