GPU-Accelerated Clusters in OCI for Deep Learning Workloads

Question

Pulumi · Accepted Answer

To set up GPU-accelerated clusters in Oracle Cloud Infrastructure (OCI) for deep learning workloads, we will be using `oci.Core.ComputeCluster` and `oci.Core.ClusterNetwork` resources. These two resources allow us to define a cluster of compute instances that can communicate over a high-speed network suitable for high-performance compute (HPC) tasks such as deep learning.

Here's what each resource accomplishes:

1. `oci.Core.ComputeCluster`: This resource creates a cluster of compute instances, allowing us to specify the hardware and software configuration, such as the GPU and other compute options.

2. `oci.Core.ClusterNetwork`: With this resource, we create a network that enables instances within the compute cluster to communicate over a low-latency, high-bandwidth network, which is essential for deep learning tasks where nodes often need to share data quickly.

Let's go through the steps to create a simple GPU-accelerated cluster:

- We'll begin by configuring the cluster network to connect the compute instances.
- Next, we will define the compute cluster itself, specifying the necessary details such as the type of GPUs and number of cores.

Below is a Pulumi program that demonstrates how to create a GPU-accelerated compute cluster for deep learning workloads in OCI using Python.

```python
import pulumi
import pulumi_oci as oci

# Replace these variables with your own values
compartment_id = "ocid1.compartment.oc1..exampleuniqueID"
availability_domain = "Uocm:PHX-AD-1"
subnet_id = "ocid1.subnet.oc1.phx.exampleuniqueID"
image_id = "ocid1.image.oc1.phx.exampleuniqueID" # Select a GPU-enabled image
shape_name = "VM.GPU2.1" # This is an example shape; use a GPU-enabled shape

# Create a Cluster Network
cluster_network = oci.core.ClusterNetwork("gpuClusterNetwork",
    compartment_id=compartment_id,
    defined_tags={
        "Owner": "DeepLearningTeam",
    },
    freeform_tags={
        "Project": "GPUDLCluster",
    },
    instance_pools=[
        {
            "size": 2,  # Specify the number of instances in the pool
            "instance_configuration_id": image_id,  # Using a GPU-enabled image
            "placement_configurations": [
                {
                    "availability_domain": availability_domain,
                    "primary_subnet_id": subnet_id,
                }
            ],
            "defined_tags": {
                "Cluster": "GPUClusterPool",
            },
        }
    ]
)

# Create a Compute Cluster tied to the Cluster Network
compute_cluster = oci.core.ComputeCluster("gpuComputeCluster",
    compartment_id=compartment_id,
    defined_tags={
        "Owner": "DeepLearningTeam",
    },
    freeform_tags={
        "Project": "GPUDLCluster",
    },
    availability_domain=availability_domain,
    cluster_network_id=cluster_network.id,  # Attach to our previously created Cluster Network
    instance_shape_name=shape_name,
    instance_source=id=image_id,  # Reference to a GPU-enabled image
)

# Outputs
pulumi.export('cluster_network_id', cluster_network.id)
pulumi.export('compute_cluster_id', compute_cluster.id)
```

Here's what our program does:

1. It starts by defining the `compartment_id`, `availability_domain`, `subnet_id`, `image_id`, and `shape_name` variables that are used to configure the cluster. These values are specific to your OCI environment and should be replaced with your own values.

2. Then we create a `ClusterNetwork` with the `oci.core.ClusterNetwork` class. We set some tags for organizational purposes and specify an instance pool with the desired instance configuration and placement details.

3. We proceed to create a `ComputeCluster` tied to the `ClusterNetwork` we created earlier. We use the `oci.Core.ComputeCluster` class to create this resource, specifying the compartment, tags, availability domain, network ID, instance shape, and the source image ID.

4. Finally, we export the Cluster Network Id and Compute Cluster Id for easy access and reference.

After defining and deploying this Pulumi program, you will have a GPU-accelerated cluster that can be used for various deep learning workloads. The individual instances within the cluster can be configured with deep learning frameworks and datasets to begin running powerful computations.