GPU-Accelerated Clusters in OCI for Deep Learning Workloads
PythonTo set up GPU-accelerated clusters in Oracle Cloud Infrastructure (OCI) for deep learning workloads, we will be using
oci.Core.ComputeCluster
andoci.Core.ClusterNetwork
resources. These two resources allow us to define a cluster of compute instances that can communicate over a high-speed network suitable for high-performance compute (HPC) tasks such as deep learning.Here's what each resource accomplishes:
-
oci.Core.ComputeCluster
: This resource creates a cluster of compute instances, allowing us to specify the hardware and software configuration, such as the GPU and other compute options. -
oci.Core.ClusterNetwork
: With this resource, we create a network that enables instances within the compute cluster to communicate over a low-latency, high-bandwidth network, which is essential for deep learning tasks where nodes often need to share data quickly.
Let's go through the steps to create a simple GPU-accelerated cluster:
- We'll begin by configuring the cluster network to connect the compute instances.
- Next, we will define the compute cluster itself, specifying the necessary details such as the type of GPUs and number of cores.
Below is a Pulumi program that demonstrates how to create a GPU-accelerated compute cluster for deep learning workloads in OCI using Python.
import pulumi import pulumi_oci as oci # Replace these variables with your own values compartment_id = "ocid1.compartment.oc1..exampleuniqueID" availability_domain = "Uocm:PHX-AD-1" subnet_id = "ocid1.subnet.oc1.phx.exampleuniqueID" image_id = "ocid1.image.oc1.phx.exampleuniqueID" # Select a GPU-enabled image shape_name = "VM.GPU2.1" # This is an example shape; use a GPU-enabled shape # Create a Cluster Network cluster_network = oci.core.ClusterNetwork("gpuClusterNetwork", compartment_id=compartment_id, defined_tags={ "Owner": "DeepLearningTeam", }, freeform_tags={ "Project": "GPUDLCluster", }, instance_pools=[ { "size": 2, # Specify the number of instances in the pool "instance_configuration_id": image_id, # Using a GPU-enabled image "placement_configurations": [ { "availability_domain": availability_domain, "primary_subnet_id": subnet_id, } ], "defined_tags": { "Cluster": "GPUClusterPool", }, } ] ) # Create a Compute Cluster tied to the Cluster Network compute_cluster = oci.core.ComputeCluster("gpuComputeCluster", compartment_id=compartment_id, defined_tags={ "Owner": "DeepLearningTeam", }, freeform_tags={ "Project": "GPUDLCluster", }, availability_domain=availability_domain, cluster_network_id=cluster_network.id, # Attach to our previously created Cluster Network instance_shape_name=shape_name, instance_source=id=image_id, # Reference to a GPU-enabled image ) # Outputs pulumi.export('cluster_network_id', cluster_network.id) pulumi.export('compute_cluster_id', compute_cluster.id)
Here's what our program does:
-
It starts by defining the
compartment_id
,availability_domain
,subnet_id
,image_id
, andshape_name
variables that are used to configure the cluster. These values are specific to your OCI environment and should be replaced with your own values. -
Then we create a
ClusterNetwork
with theoci.core.ClusterNetwork
class. We set some tags for organizational purposes and specify an instance pool with the desired instance configuration and placement details. -
We proceed to create a
ComputeCluster
tied to theClusterNetwork
we created earlier. We use theoci.Core.ComputeCluster
class to create this resource, specifying the compartment, tags, availability domain, network ID, instance shape, and the source image ID. -
Finally, we export the Cluster Network Id and Compute Cluster Id for easy access and reference.
After defining and deploying this Pulumi program, you will have a GPU-accelerated cluster that can be used for various deep learning workloads. The individual instances within the cluster can be configured with deep learning frameworks and datasets to begin running powerful computations.
-