How do I monitor GPU utilization for machine learning pods in Kubernetes?
In this guide, we will set up monitoring for GPU utilization in a Kubernetes cluster running machine learning workloads. We will use NVIDIA’s DCGM (Data Center GPU Manager) exporter to collect GPU metrics and Prometheus to scrape these metrics. Finally, we will visualize the metrics using Grafana.
Key Points:
- Deploy NVIDIA DCGM exporter to collect GPU metrics.
- Set up Prometheus to scrape GPU metrics.
- Deploy Grafana for visualization.
import * as pulumi from "@pulumi/pulumi";
import * as k8s from "@pulumi/kubernetes";
// Create a namespace for monitoring components
const monitoringNamespace = new k8s.core.v1.Namespace("monitoring", {
metadata: { name: "monitoring" },
});
// Deploy NVIDIA DCGM exporter
const dcgmExporter = new k8s.apps.v1.Deployment("dcgm-exporter", {
metadata: {
namespace: monitoringNamespace.metadata.name,
labels: { app: "dcgm-exporter" },
},
spec: {
replicas: 1,
selector: { matchLabels: { app: "dcgm-exporter" } },
template: {
metadata: { labels: { app: "dcgm-exporter" } },
spec: {
containers: [{
name: "dcgm-exporter",
image: "nvidia/dcgm-exporter:2.1.7-2.4.7-ubuntu20.04",
ports: [{ containerPort: 9400 }],
resources: {
limits: { nvidia_com_gpu: "1" },
},
}],
nodeSelector: { "kubernetes.io/hostname": "gpu-node" },
},
},
},
});
// Deploy Prometheus
const prometheus = new k8s.apps.v1.Deployment("prometheus", {
metadata: {
namespace: monitoringNamespace.metadata.name,
labels: { app: "prometheus" },
},
spec: {
replicas: 1,
selector: { matchLabels: { app: "prometheus" } },
template: {
metadata: { labels: { app: "prometheus" } },
spec: {
containers: [{
name: "prometheus",
image: "prom/prometheus:v2.26.0",
ports: [{ containerPort: 9090 }],
args: [
"--config.file=/etc/prometheus/prometheus.yml",
"--storage.tsdb.path=/prometheus",
"--web.console.libraries=/usr/share/prometheus/console_libraries",
"--web.console.templates=/usr/share/prometheus/consoles",
],
volumeMounts: [{
name: "config-volume",
mountPath: "/etc/prometheus",
}],
}],
volumes: [{
name: "config-volume",
configMap: { name: "prometheus-config" },
}],
},
},
},
});
// Prometheus config
const prometheusConfig = new k8s.core.v1.ConfigMap("prometheus-config", {
metadata: {
namespace: monitoringNamespace.metadata.name,
},
data: {
"prometheus.yml": `
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'dcgm'
static_configs:
- targets: ['dcgm-exporter.monitoring.svc.cluster.local:9400']
labels:
group: 'gpu'
`,
},
});
// Deploy Grafana
const grafana = new k8s.apps.v1.Deployment("grafana", {
metadata: {
namespace: monitoringNamespace.metadata.name,
labels: { app: "grafana" },
},
spec: {
replicas: 1,
selector: { matchLabels: { app: "grafana" } },
template: {
metadata: { labels: { app: "grafana" } },
spec: {
containers: [{
name: "grafana",
image: "grafana/grafana:7.5.5",
ports: [{ containerPort: 3000 }],
env: [{
name: "GF_SECURITY_ADMIN_PASSWORD",
value: "admin",
}],
}],
},
},
},
});
// Expose Grafana service
const grafanaService = new k8s.core.v1.Service("grafana-service", {
metadata: {
namespace: monitoringNamespace.metadata.name,
labels: { app: "grafana" },
},
spec: {
type: "NodePort",
ports: [{ port: 3000, targetPort: 3000, nodePort: 32000 }],
selector: { app: "grafana" },
},
});
Concluding Summary:
In this guide, we deployed NVIDIA’s DCGM exporter to collect GPU metrics, set up Prometheus to scrape these metrics, and deployed Grafana to visualize the metrics. This setup helps monitor GPU utilization for machine learning pods in a Kubernetes cluster, providing insights into GPU resource usage.
Deploy this code
Want to deploy this code? Sign up for a free Pulumi account to deploy in a few clicks.
Sign upNew to Pulumi?
Want to deploy this code? Sign up with Pulumi to deploy in a few clicks.
Sign upThank you for your feedback!
If you have a question about how to use Pulumi, reach out in Community Slack.
Open an issue on GitHub to report a problem or suggest an improvement.