GPU Driver Installation via VM Extensions for Deep Learning

Question

Pulumi · Accepted Answer

To accomplish GPU driver installation on a Virtual Machine (VM) for the purpose of deep learning, we can use Pulumi to manage the infrastructure. The process generally involves creating a VM and then using VM extensions to automatically install the necessary GPU drivers. Microsoft Azure is a common choice for such tasks because it provides VMs with specialized GPUs and supports the use of extensions to customize and configure VMs post-deployment.

In this Pulumi program, we’ll go through the process of setting up an Azure Virtual Machine with a GPU and installing the necessary drivers using the VM extension for NVIDIA GPU.

We will be using the `azure-native` provider to interact with Microsoft Azure resources. The resources that will be utilized are:

- `ResourceGroup`: A container that holds related resources for an Azure solution.
- `VirtualMachine`: The Azure resource for creating and managing a VM.
- `VirtualMachineExtension`: An Azure resource that allows you to add new capabilities to your VMs, in this case, a script extension to install the GPU drivers.

First, we will create a resource group. Then, we will provision a VM with a GPU size that is available in Azure. After the VM is up and running, we will apply a VM extension to install the NVIDIA GPU driver.

To run this program, make sure you have:

- Installed the Pulumi CLI.
- Configured your Azure credentials for Pulumi using the `az login` command or by setting up the appropriate environment variables.

Below is a complete Pulumi program in Python that demonstrates these steps:

```python
import pulumi
import pulumi_azure_native as azure_native
from pulumi_azure_native import compute

# Create a resource group for the VM
resource_group = azure_native.resources.ResourceGroup('gpu-resource-group')

# Create a virtual network and subnet for the VM
vnet = azure_native.network.VirtualNetwork(
    'gpu-vnet',
    resource_group_name=resource_group.name,
    address_space=compute.AddressSpaceArgs(
        address_prefixes=['10.0.0.0/16']
    ),
    subnets=[compute.SubnetArgs(
        name='default',
        address_prefix='10.0.1.0/24'
    )]
)

# Create a public IP address for the VM
public_ip = azure_native.network.PublicIPAddress(
    'gpu-public-ip',
    resource_group_name=resource_group.name,
    public_ip_allocation_method=compute.IPAllocationMethod.DYNAMIC,
)

# Create a Network Interface for the VM with the Public IP Address
network_interface = azure_native.network.NetworkInterface(
    'gpu-nic',
    resource_group_name=resource_group.name,
    ip_configurations=[compute.NetworkInterfaceIPConfigurationArgs(
        name='default',
        subnet=compute.SubnetArgs(
            id=vnet.subnets.apply(lambda subnets: subnets[0].id),
        ),
        public_ip_address=compute.PublicIPAddressArgs(
            id=public_ip.id,
        ),
    )]
)

# Now, define the virtual machine size and image settings to use a size that supports GPUs.
# This example uses the "Standard_NC6" size which includes a Tesla K80 GPU.
# The image reference is set to use an Ubuntu server image that is compatible with GPU-based workloads.
vm = azure_native.compute.VirtualMachine(
    'gpu-vm',
    resource_group_name=resource_group.name,
    network_profile=compute.NetworkProfileArgs(
        network_interfaces=[compute.NetworkInterfaceReferenceArgs(
            id=network_interface.id
        )]
    ),
    hardware_profile=compute.HardwareProfileArgs(
        vm_size='Standard_NC6'
    ),
    os_profile=compute.OSProfileArgs(
        computer_name='gpuvm',
        admin_username='azureuser',
        linux_configuration=compute.LinuxConfigurationArgs(
            disable_password_authentication=True,
            ssh=compute.SSHConfigurationArgs(
                public_keys=[compute.SSHPublicKeyArgs(
                    path='/home/azureuser/.ssh/authorized_keys',
                    key_data='ssh-rsa ...'  # Replace with your actual SSH public key
                )]
            )
        )
    ),
    storage_profile=compute.StorageProfileArgs(
        os_disk=compute.OSDiskArgs(
            create_option=compute.DiskCreateOptionTypes.FROM_IMAGE,
        ),
        image_reference=compute.ImageReferenceArgs(
            publisher='Canonical',
            offer='UbuntuServer',
            sku='18.04-LTS',
            version='latest'
        ),
    )
)

# Finally, we install the GPU driver using the VM Extension mechanism.
# This VM extension uses a script to download and install the driver.
nvidia_driver_extension = azure_native.compute.VirtualMachineExtension(
    'nvidia-gpu-driver-extension',
    resource_group_name=resource_group.name,
    virtual_machine_name=vm.name,
    publisher='Microsoft.HpcCompute',
    type_handler_version='1.3',
    auto_upgrade_minor_version=True,
    type='NvidiaGpuDriverLinux',
    location=resource_group.location
)

# Export the public IP to access the VM, and the resource group name for future references.
pulumi.export('public_ip', public_ip.ip_address)
pulumi.export('resource_group', resource_group.name)
```

This program defines and provisions all the necessary resources to create an Azure Virtual Machine with a GPU and installs the NVIDIA GPU driver using Pulumi.

Make sure you replace `'ssh-rsa ...'` with your actual SSH public key.

To run this program, save it as `__main__.py` and execute `pulumi up` on your command line. This will prompt Pulumi to provision the resources as defined.

After the GPU driver is installed, you can SSH into the machine using the public IP address that will be displayed as the output of `pulumi up`. Once logged in, you can begin setting up your deep learning environment, such as installing frameworks like TensorFlow or PyTorch.