Running distributed deep learning with GPUs on Azure ML Compute Clusters
TypeScriptTo run distributed deep learning with GPUs on Azure ML Compute Clusters, you will first need an Azure Machine Learning workspace, then create a compute cluster with GPU-enabled virtual machines. Within the Azure ML service, a compute cluster provides scalable virtual machine resources for running machine learning tasks. When you request GPU resources, you ensure that the virtual machines are equipped with GPU capabilities for your deep learning tasks.
Below is a Pulumi program written in TypeScript that sets up an Azure Machine Learning workspace and a GPU-enabled compute cluster. We will be using resources from the
azure-native
Pulumi provider.This program performs the following steps:
- It creates an Azure resource group. This is a container that holds related resources for an Azure solution.
- It establishes an Azure Machine Learning workspace, which is a foundational resource in the cloud that you use to experiment, train, and deploy machine learning models.
- It provisions a compute cluster with GPU-enabled virtual machines by specifying a SKU that includes GPUs.
Before you start, ensure you have the Pulumi CLI installed and configured for access to your Azure subscription. Install the required packages with
npm install @pulumi/azure-native
.Here is the program:
import * as pulumi from "@pulumi/pulumi"; import * as azure from "@pulumi/azure-native"; // Define configuration variables const location = pulumi.getConfig("location") || "East US"; const gpuSku = "Standard_NC6"; // This is an Azure VM size with GPU capabilities. // Create an Azure Resource Group const resourceGroup = new azure.resources.ResourceGroup("myresourcegroup", { location, }); // Create an Azure Machine Learning workspace const workspace = new azure.machinelearningservices.Workspace("myworkspace", { resourceGroupName: resourceGroup.name, sku: { name: "Basic", }, location: resourceGroup.location, }); // Create a GPU-enabled Azure Machine Learning Compute Cluster const computeCluster = new azure.machinelearningservices.Compute("mycomputegpu", { resourceGroupName: resourceGroup.name, computeName: "gpufordeeplearning", location: resourceGroup.location, properties: { // Specify a VM size known to have GPU capabilities vmSize: gpuSku, // Define the scale settings for the compute cluster scaleSettings: { maxNodeCount: 4, // Max number of nodes to scale out to while training minNodeCount: 0, // Min number of nodes to always keep alive nodeIdleTimeBeforeScaleDown: "PT5M" // Scale down if the node is idle for 5 minutes }, computeType: "AmlCompute", // Use Azure Machine Learning Compute }, workspaceName: workspace.name, }); // Export the details of the compute cluster export const computeClusterId = computeCluster.id; export const workspaceName = workspace.name;
In this program,
gpuSku
specifies the Azure VM size for the nodes in the compute cluster. TheStandard_NC6
VM size includes one GPU, which is suitable for many deep learning tasks.Please adjust the maximum number of nodes (
maxNodeCount
) in the scale settings of the compute cluster to comply with your specific requirements and budget. Virtual machine sizes and SKUs with different GPU configurations are available and should be selected based on the workload's demands and the cost considerations.After running this Pulumi code, you'll have a compute cluster ready to perform distributed deep learning with GPUs. Make sure to check Azure's documentation for the latest VM SKUs when choosing a VM size and ensure that your Azure quotas allow for the creation of the desired resources.
To deploy the code above, save it in a file
index.ts
, and runpulumi up
from the same directory. Pulumi will then perform the deployment to Azure.