published on Wednesday, Mar 11, 2026 by Pulumi
EKS GPU Dynamic Resource Allocation (DRA) Demo
published on Wednesday, Mar 11, 2026 by Pulumi
A Pulumi program that provisions an Amazon EKS 1.34 cluster with NVIDIA GPU support using Dynamic Resource Allocation (DRA) and Multi-Instance GPU (MIG) technology.
Overview
This project demonstrates:
- EKS 1.34 cluster with GPU nodes (p4d.24xlarge with A100 40GB GPUs)
- NVIDIA GPU Operator with MIG Manager
- NVIDIA DRA driver for GPU resource allocation
- MIG configuration with multiple profile sizes (1g.5gb, 2g.10gb, 3g.20gb)
- Fashion-MNIST workloads demonstrating concurrent GPU sharing
- Prometheus + Grafana monitoring with DCGM dashboards
Prerequisites
- Pulumi CLI (>= v3): https://www.pulumi.com/docs/get-started/install/
- Node.js (>= 14): https://nodejs.org/
- AWS credentials configured with permissions to create EKS clusters
- Pulumi ESC environment configured for authentication (pulumi-idp/auth)
Architecture
Cluster Configuration
- System Node Group: m6i.large instances for system workloads
- GPU Node Group: p4d.24xlarge instances (8× A100 40GB GPUs) with MIG enabled
- MIG Configuration:
all-balancedprofile creates:- 2× 1g.5gb slices
- 1× 2g.10gb slice
- 1× 3g.20gb slice
- Per GPU (total 8 GPUs)
Fashion-MNIST Workloads
Three concurrent workloads demonstrate MIG GPU sharing:
- Large Training (3g.20gb): ResNet-18 training with batch size 256, ~15GB memory
- Medium Training (2g.10gb): Custom CNN training with batch size 128, ~8GB memory
- Small Inference (1g.5gb): Simple MLP inference with batch size 32, ~3GB memory
All workloads run simultaneously on the same physical GPU using different MIG slices.
Getting Started
Deploy Infrastructure
Install dependencies:
npm installPreview and deploy:
pulumi preview pulumi upWait for GPU nodes to provision and MIG Manager to configure GPUs
Verify MIG Configuration
Check GPU node status:
pulumi env run pulumi-idp/auth -- kubectl get nodes -l node-role=gpuVerify MIG configuration:
pulumi env run pulumi-idp/auth -- kubectl get node <gpu-node-name> -o yaml | grep mig
Monitor Fashion-MNIST Workloads
Check pod status:
pulumi env run pulumi-idp/auth -- kubectl get pods -n mig-test -wView large training logs:
pulumi env run pulumi-idp/auth -- kubectl logs -f mig-large-training-pod -n mig-testView medium training logs:
pulumi env run pulumi-idp/auth -- kubectl logs -f mig-medium-training-pod -n mig-testView small inference logs:
pulumi env run pulumi-idp/auth -- kubectl logs -f mig-small-inference-pod -n mig-testVerify all pods are on the same GPU:
pulumi env run pulumi-idp/auth -- kubectl exec mig-large-training-pod -n mig-test -- nvidia-smi
Access Grafana Dashboard
Get Grafana LoadBalancer URL:
pulumi env run pulumi-idp/auth -- kubectl get svc -n monitoring kube-prometheus-stack-grafanaAccess Grafana at the LoadBalancer URL
- Username:
admin - Password:
gpu-monitoring-demo
- Username:
Navigate to the “NVIDIA DCGM MIG” dashboard to view GPU metrics
Expected Results
- All three pods should reach Running state
- Training pods should show increasing accuracy over epochs
- Inference pod should show continuous throughput
- Grafana should display 60-90% GPU utilization across MIG slices
- All workloads should be sharing the same physical GPU
- No OOM errors or pod evictions
Project Layout
index.ts— Main Pulumi programmig-policy/— Pulumi Policy Pack for MIG profile enforcementpackage.json— Node.js dependenciestsconfig.json— TypeScript compiler optionsPulumi.yaml— Pulumi project metadata
Configuration
| Key | Description | Default |
|---|---|---|
clusterName | Name for the EKS cluster | gpu-dra-cluster |
aws:region | AWS region to deploy resources | Set in stack config |
Cleanup
To destroy all resources:
pulumi destroy
pulumi stack rm
Note: GPU instances are expensive. Destroy resources when not in use.
Troubleshooting
Pods Stuck in Pending
- Check GPU node status and MIG configuration
- Verify DRA driver is running:
kubectl get pods -n nvidia-dra-driver - Check GPU Operator status:
kubectl get pods -n gpu-operator
MIG Configuration Not Applied
- Check MIG Manager logs:
kubectl logs -n gpu-operator -l app=nvidia-mig-manager - Verify node labels:
kubectl get nodes -l nvidia.com/mig.config=all-balanced - May require node reboot (MIG Manager sets
WITH_REBOOT=true)
Fashion-MNIST Downloads Failing
- Pods require internet access to download Fashion-MNIST dataset
- Verify NAT Gateway is configured for private subnets
- Check pod logs for download errors
Additional Resources
published on Wednesday, Mar 11, 2026 by Pulumi
