Batch Processing AI Workloads on OCI Core Instances

Question

Pulumi · Accepted Answer

To accomplish batch processing of AI workloads on Oracle Cloud Infrastructure (OCI), we require a setup that can scale to handle varying compute and storage demands. The workflow often includes setting up cloud instances customized with necessary hardware (like GPUs for machine learning), software environments, and the ability to process massive datasets efficiently.

In this scenario, we'll develop a Pulumi program that provisions OCI core instances tailored for AI workloads. The key components for this setup include:

1. **Compute Instances (oci.Core.Instance):** These are the virtual machines where your AI workloads will run. You can select an appropriate shape (type and quantity of CPUs, memory, and GPUs) based on the requirements of your workloads. For batch processing, you may need powerful instances with GPUs.

2. **Instance Pool (oci.Core.InstancePool):** This manages a pool of identical instances, which allows you to maintain a consistent configuration across them and enables autoscaling according to the workload demands.

3. **Instance Configuration (oci.Core.InstanceConfiguration):** This is a template that describes the setup of instances you want in your pool. It includes the instance shape, attached block volumes, networking setup, and more.

4. **Block Volumes (oci.Core.Volume):** For AI workloads, you'll likely need extra storage for datasets and models, which can be attached to instances as block volumes.

5. **Networking (oci.Core.Vcn and oci.Core.Subnet):** You need a virtual cloud network (VCN) and subnets to provide network infrastructure for your instances.

In the Pulumi program example below, we'll create a simple compute instance configured for AI. If your application requires it, you would adjust this to create an instance pool using an InstanceConfiguration as a template. For simplicity, we're creating a single instance in a given availability domain and subnet. Please replace placeholder values with your actual OCI compartment, network, and machine image specifics.

```python
import pulumi
from pulumi_oci import core

# Initialize a Pulumi project with the required OCI provider configuration.
# Your OCI configuration like region, tenancy OCID, user OCID, private key path, and fingerprint should already be set up in advance.

# Define the desired OCI compute instance shape and image.
# For AI workloads, select shapes optimized for machine learning tasks, which often include GPU support.
# The image ID should be an OCI image that supports your AI workload, or a custom image that you've prepared with necessary software.
compute_shape = "VM.GPU2.1"  # Example shape with GPU.
image_id = "ocid1.image.oc1...your-image-id"

# Set up the details for the networking.
compartment_id = "ocid1.compartment.oc1..your-compartment-id"  # Replace with your compartment OCID.
subnet_id = "ocid1.subnet.oc1..your-subnet-id"  # Replace with your subnet OCID.

# Provision the Compute Instance.
instance = core.Instance("ai-workload-instance",
    compartment_id=compartment_id,
    shape=compute_shape,
    source_details=core.InstanceSourceDetailsArgs(
        source_type="image",  # Assuming you're using an image; for boot volumes, change this to "bootVolume".
        image_id=image_id,
    ),
    create_vnic_details=core.InstanceCreateVnicDetailsArgs(
        subnet_id=subnet_id,
        assign_public_ip=True,  # Or False, depending on whether you want the instance to be accessible from the internet.
    ),
    metadata={
        "ssh_authorized_keys": "ssh-rsa AAA... your-ssh-public-key",  # Replace with your SSH public key.
    }
)

# Export the public IP of the instance to be easily accessible,
# for example, if you need to SSH into the instance for setup or debugging.
pulumi.export("instance_public_ip", instance.public_ip)
```

This program sets up a single GPU-backed compute instance with an SSH key for access. You'd need to include similar resources for more complex setups, such as instance pools, block storage for data, and more. This is a starting point, and as you dive deeper into your AI workloads, you may need to flesh out the networking components, add security groups, and configure autoscaling.

Remember to replace the placeholder values with your specific OCI configuration, and ensure your Pulumi CLI is set up with the necessary access permissions for your OCI account. The instance's public IP address is exported as an output, which you can use to access the instance remotely, like with SSH.