OCI Core Virtual Machines for GPU-Accelerated Machine Learning

Question

Pulumi · Accepted Answer

To provision GPU-accelerated Virtual Machines (VMs) for Machine Learning on Oracle Cloud Infrastructure (OCI) using Pulumi, you would typically use the `oci.Core.ComputeInstance` resources. These instances can be configured to include GPU shapes that provide the hardware acceleration required for your machine learning tasks.

Below, I will provide a Pulumi program that sets up a GPU-accelerated VM on OCI, taking into account the following:
- The VM will be created within a compartment and availability domain that you specify.
- We will configure the VM with a shape that supports GPU, such as the `VM.GPU2.1` shape, which provides NVIDIA GPUs.
- You will need to provide your own custom image or select an existing one that supports the software stack required for machine learning.

The program will be descriptive, leveraging Pulumi's Oracle Cloud Infrastructure (OCI) provider.

### Program Explanation

1. **Compartment**: OCI's compartment structure for organizing and isolating your cloud resources. You will need to specify an existing compartment ID where your resources will be created.
  
2. **Availability Domain**: OCI's way of providing high availability. An availability domain is a data center in a region. Multiple availability domains within a region are interconnected via a low-latency network.

3. **Virtual Cloud Network (VCN)**: To create a VM, you need a virtual network. If you do not have an existing VCN and subnet, we will create them in this program.
   
4. **Compute Instance**: The VM itself. The configuration includes specifying the compartment, availability domain, subnet for network access, shape (which includes GPU), and the image source for the OS and software stack.

Here's how you might write such a program in Python using Pulumi:

```python
import pulumi
import pulumi_oci as oci

# Configuration
# Replace these with your own OCI compartment ID and availability domain
compartment_id = 'ocid1.compartment.oc1..exampleuniqueID'
availability_domain = 'example-availability-domain'
# Specify the subnet OCID if you have an existing one, otherwise set to None
subnet_id = None 
# Replace with an OCID of an image that supports NVIDIA GPUs
image_id = 'ocid1.image.oc1..exampleuniqueID'
# Shape with GPU support
instance_shape = 'VM.GPU2.1'

# Virtual Cloud Network (VCN) and Subnet - Create if not provided
if not subnet_id:
    vcn = oci.core.VirtualNetwork('gpu-ml-vcn',
        compartment_id=compartment_id,
        cidr_block='10.0.0.0/16',
        display_name='gpu-ml-vcn',
        dns_label='gpumlvcn',
        is_ipv6enabled=False
        # For more properties, see https://www.pulumi.com/registry/packages/oci/api-docs/core/virtualnetwork/
    )

subnet = oci.core.Subnet('gpu-ml-subnet',
        compartment_id=compartment_id,
        vcn_id=vcn.id,
        cidr_block='10.0.1.0/24',
        display_name='gpu-ml-subnet',
        dns_label='gpumlsubnet',
        availability_domain=availability_domain,
        # For more properties, see https://www.pulumi.com/registry/packages/oci/api-docs/core/subnet/
    )
    subnet_id = subnet.id

# Compute Instance
compute_instance = oci.core.ComputeInstance('gpu-ml-instance',
    availability_domain=availability_domain,
    compartment_id=compartment_id,
    shape=instance_shape,
    subnet_id=subnet_id,
    image_id=image_id,
    display_name='ML-with-GPU',
    metadata={
        'ssh_authorized_keys': 'ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQ...',
    },
    # For specifying GPU and other configs, refer to instance_shape_config parameters
    # See https://www.pulumi.com/registry/packages/oci/api-docs/core/computeinstance/
)

pulumi.export('ComputeInstanceIP', compute_instance.public_ip)
pulumi.export('ComputeInstanceID', compute_instance.id)
```

### Explanation of Pulumi Resources Used:

- `oci.core.VirtualNetwork`: Represents Virtual Cloud Network (VCN) in OCI which allows you to build your own virtual network within OCI.

- `oci.core.Subnet`: A subdivision within a VCN that allows you to segment your network for different parts of your cloud topology.

- `oci.core.ComputeInstance`: Represents a compute instance in OCI. This resource is used to create and manage virtual machines in your OCI environment.

In the code above, we've set up a VCN and subnet only if they aren't already provided. The `compute_instance` is configured with a GPU shape and a predefined image for GPU-accelerated tasks. The ssh public key is placeholder and should be replaced with your actual key to enable SSH access to the VM.

If you plan to create a complete environment for machine learning that includes more parameters and resources like load balancers, additional networking, or database configurations, you'll want to include those in your Pulumi program as needed.