AWS v7.22.0, Mar 11 26

Viewing docs for AWS v7.22.0
published on Wednesday, Mar 11, 2026 by Pulumi

pulumi/pulumi-aws

EKS GPU Dynamic Resource Allocation (DRA) Demo

Viewing docs for AWS v7.22.0
published on Wednesday, Mar 11, 2026 by Pulumi

Schema (JSON)

pulumi/pulumi-aws

Overview

This project demonstrates:

EKS 1.34 cluster with GPU nodes (p4d.24xlarge with A100 40GB GPUs)
NVIDIA GPU Operator with MIG Manager
NVIDIA DRA driver for GPU resource allocation
MIG configuration with multiple profile sizes (1g.5gb, 2g.10gb, 3g.20gb)
Fashion-MNIST workloads demonstrating concurrent GPU sharing
Prometheus + Grafana monitoring with DCGM dashboards

Prerequisites

Pulumi CLI (>= v3): https://www.pulumi.com/docs/get-started/install/
Node.js (>= 14): https://nodejs.org/
AWS credentials configured with permissions to create EKS clusters
Pulumi ESC environment configured for authentication (pulumi-idp/auth)

Architecture

Cluster Configuration

System Node Group: m6i.large instances for system workloads
GPU Node Group: p4d.24xlarge instances (8× A100 40GB GPUs) with MIG enabled
MIG Configuration: all-balanced profile creates:
- 2× 1g.5gb slices
- 1× 2g.10gb slice
- 1× 3g.20gb slice
- Per GPU (total 8 GPUs)

Fashion-MNIST Workloads

Three concurrent workloads demonstrate MIG GPU sharing:

Large Training (3g.20gb): ResNet-18 training with batch size 256, ~15GB memory
Medium Training (2g.10gb): Custom CNN training with batch size 128, ~8GB memory
Small Inference (1g.5gb): Simple MLP inference with batch size 32, ~3GB memory

All workloads run simultaneously on the same physical GPU using different MIG slices.

Getting Started

Deploy Infrastructure

Install dependencies:
```
npm install
```
Preview and deploy:
```
pulumi preview
pulumi up
```
Wait for GPU nodes to provision and MIG Manager to configure GPUs

Verify MIG Configuration

Check GPU node status:

pulumi env run pulumi-idp/auth -- kubectl get nodes -l node-role=gpu

Verify MIG configuration:

pulumi env run pulumi-idp/auth -- kubectl get node <gpu-node-name> -o yaml | grep mig

Monitor Fashion-MNIST Workloads

Check pod status:

pulumi env run pulumi-idp/auth -- kubectl get pods -n mig-test -w

View large training logs:

pulumi env run pulumi-idp/auth -- kubectl logs -f mig-large-training-pod -n mig-test

View medium training logs:

pulumi env run pulumi-idp/auth -- kubectl logs -f mig-medium-training-pod -n mig-test

View small inference logs:

pulumi env run pulumi-idp/auth -- kubectl logs -f mig-small-inference-pod -n mig-test

Verify all pods are on the same GPU:

pulumi env run pulumi-idp/auth -- kubectl exec mig-large-training-pod -n mig-test -- nvidia-smi

Access Grafana Dashboard

Get Grafana LoadBalancer URL:

pulumi env run pulumi-idp/auth -- kubectl get svc -n monitoring kube-prometheus-stack-grafana

Access Grafana at the LoadBalancer URL
- Username: admin
- Password: gpu-monitoring-demo
Navigate to the “NVIDIA DCGM MIG” dashboard to view GPU metrics

Expected Results

All three pods should reach Running state
Training pods should show increasing accuracy over epochs
Inference pod should show continuous throughput
Grafana should display 60-90% GPU utilization across MIG slices
All workloads should be sharing the same physical GPU
No OOM errors or pod evictions

Project Layout

index.ts — Main Pulumi program
mig-policy/ — Pulumi Policy Pack for MIG profile enforcement
package.json — Node.js dependencies
tsconfig.json — TypeScript compiler options
Pulumi.yaml — Pulumi project metadata

Configuration

Key	Description	Default
`clusterName`	Name for the EKS cluster	`gpu-dra-cluster`
`aws:region`	AWS region to deploy resources	Set in stack config

Cleanup

To destroy all resources:

pulumi destroy
pulumi stack rm

Note: GPU instances are expensive. Destroy resources when not in use.

Troubleshooting

Pods Stuck in Pending

Check GPU node status and MIG configuration
Verify DRA driver is running: kubectl get pods -n nvidia-dra-driver
Check GPU Operator status: kubectl get pods -n gpu-operator

MIG Configuration Not Applied

Check MIG Manager logs: kubectl logs -n gpu-operator -l app=nvidia-mig-manager
Verify node labels: kubectl get nodes -l nvidia.com/mig.config=all-balanced
May require node reboot (MIG Manager sets WITH_REBOOT=true)

Fashion-MNIST Downloads Failing

Pods require internet access to download Fashion-MNIST dataset
Verify NAT Gateway is configured for private subnets
Check pod logs for download errors

Additional Resources

Viewing docs for AWS v7.22.0
published on Wednesday, Mar 11, 2026 by Pulumi

Schema (JSON)

pulumi/pulumi-aws

EKS GPU Dynamic Resource Allocation (DRA) Demo

On this page

On this page

Overview

Prerequisites

Architecture

Cluster Configuration

Fashion-MNIST Workloads

Getting Started

Deploy Infrastructure

Verify MIG Configuration

Monitor Fashion-MNIST Workloads

Access Grafana Dashboard

Expected Results

Project Layout

Configuration

Cleanup

Troubleshooting

Pods Stuck in Pending

MIG Configuration Not Applied

Fashion-MNIST Downloads Failing

Additional Resources

On this page

On this page