Outbound Data Transfer for OCI Machine Learning Jobs

Question

Pulumi · Accepted Answer

In the context of Oracle Cloud Infrastructure (OCI), machine learning jobs typically involve training models, running predictions, or processing data. These jobs often need to read input data from a source and write output data to a destination, potentially incurring outbound data transfer costs. Outbound data transfer fees are costs associated with sending data from OCI services to the internet or to other cloud providers.

In Pulumi, managing machine learning jobs within OCI would involve defining resources such as jobs, job runs, and data science projects. You would use the oci Pulumi provider to create and manage these resources. To facilitate this, Pulumi allows you to declare these resources in a declarative manner using Python (or other supported languages), which it then translates into OCI API calls.

Below is a Pulumi program in Python that sets up a machine learning job run on OCI. This program does not explicitly manage outbound data transfer, as that is typically handled by OCI's pricing and not something you would define within your infrastructure code. However, it's essential to understand that OCI's pricing model will include costs for outbound data if your machine learning job outputs data to an external destination.

Note: The following program assumes you have set up the Pulumi CLI and configured it with the appropriate OCI credentials. It also assumes you have already created a Data Science project and have the necessary details like projectId, compartmentId, etc.

import pulumi
import pulumi_oci as oci

# Create an OCI Data Science Job
# Documentation: https://www.pulumi.com/registry/packages/oci/api-docs/datascience/job/
job = oci.datascience.Job("job",
    project_id="<your-project-id>",  # Replace with your Project ID
    compartment_id="<your-compartment-id>",  # Replace with your Compartment ID
    display_name="MyMachineLearningJob",
    job_configuration_details=oci.datascience.JobJobConfigurationDetailsArgs(
        job_type="DEFAULT",
        command_line_arguments="--input input_data.csv --output predictions.csv",
        environment_variables={
            "MODEL_FILE": "model.pkl",
        },
        maximum_runtime_in_minutes=60,
    ),
    job_infrastructure_configuration_details=oci.datascience.JobJobInfrastructureConfigurationDetailsArgs(
        shape_name="VM.Standard2.1",  # The shape for the compute instance
        block_storage_size_in_gbs=50,
        job_infrastructure_type="STANDALONE",
    ),
    # Additional properties like description, tags, etc. can be set here.
)

# Create a Job Run for the previously defined Job
# Documentation: https://www.pulumi.com/registry/packages/oci/api-docs/datascience/jobrun/
job_run = oci.datascience.JobRun("jobRun",
    job_id=job.id,
    project_id="<your-project-id>",  # Replace with your Project ID
    compartment_id="<your-compartment-id>",  # Replace with your Compartment ID
    display_name="MyJobRun",
    job_configuration_override_details=oci.datascience.JobRunJobConfigurationOverrideDetailsArgs(
        job_type="DEFAULT",
        command_line_arguments="--input input_data.csv --output predictions.csv",
        environment_variables={
            "MODEL_FILE": "model.pkl",
        },
        maximum_runtime_in_minutes=60,
    ),
    # You could set override details for logging configuration here if necessary.
)

# Exports the Job Run ID once it is created
pulumi.export("job_run_id", job_run.id)

In the above program:

We create a machine learning job (oci.datascience.Job) that specifies the runtime parameters, environment variables, and the job's hardware configuration (like the compute shape and block storage size).
Then, we initiate a job run (oci.datascience.JobRun) using the job's ID, which represents an execution instance of the job.

Note:

The command_line_arguments are placeholders; you would replace them with actual parameters for your job, including paths to input and output data.
Outbound data transfer fees would apply if this job's output were written to an external Internet location or another cloud provider.

When it comes to cost management, specifically for outbound data transfer, it's usually a post-deployment concern that you need to monitor using OCI's cost management tools. However, you can architect your system to minimize data transfer costs, for example, by ensuring your input/output data and processing jobs are in the same region or by considering data transfer pricing when choosing services and regions for your infrastructure.