Automated ML Workflow Orchestration with CRDs

Question

Pulumi · Accepted Answer

When working with Kubernetes, Custom Resource Definitions (CRDs) extend the API to manage custom objects, such as orchestrating ML workflows as part of machine learning operations.

We can use Pulumi to define these CRDs and create instances of these custom resources, thereby enabling automated ML workflow orchestration. In this example, we'll define a CRD for a simple ML workflow and create an instance of this workflow as a custom resource in our Kubernetes cluster. We will use Pulumi's Kubernetes provider to manage the Kubernetes resources.

CRDs are powerful because they allow you to define your own "Kinds" of resources that are as fully featured as native Kubernetes kinds like Pods or Services. This means they can have their own schema, validation, and lifecycle.

This Pulumi program will:
1. Define a CRD for an ML workflow.
2. Create an instance of that CRD to instantiate a workflow.

Please note that for a real-world use case, you would need to implement the actual logic for the ML workflow either within a Kubernetes operator or within the application code running in the pods referenced by the CRD.

Let's dive into the Pulumi program:

```python
import pulumi
import pulumi_kubernetes as kubernetes

# Define the CustomResourceDefinition (CRD) for our ML workflow.
ml_workflow_crd = kubernetes.apiextensions.v1.CustomResourceDefinition(
    "mlWorkflow",
    metadata=kubernetes.meta.v1.ObjectMetaArgs(
        name="mlworkflows.sample.pulumi.com",
    ),
    spec=kubernetes.apiextensions.v1.CustomResourceDefinitionSpecArgs(
        group="sample.pulumi.com",
        versions=[kubernetes.apiextensions.v1.CustomResourceDefinitionVersionArgs(
            name="v1",
            served=True,
            storage=True,
            schema=kubernetes.apiextensions.v1.CustomResourceValidationArgs(
                # Define the openAPIV3Schema for the Custom Resources that will be using this CRD.
                openAPIV3Schema=kubernetes.apiextensions.v1.JSONSchemaPropsArgs(
                    type="object",
                    properties={
                        "spec": kubernetes.apiextensions.v1.JSONSchemaPropsArgs(
                            type="object",
                            properties={
                                "modelType": kubernetes.apiextensions.v1.JSONSchemaPropsArgs(type="string"),
                                "trainingData": kubernetes.apiextensions.v1.JSONSchemaPropsArgs(type="string"),
                            },
                            required=["modelType", "trainingData"],
                        ),
                    },
                ),
            ),
        )],
        scope="Namespaced",
        names=kubernetes.apiextensions.v1.CustomResourceDefinitionNamesArgs(
            plural="mlworkflows",
            singular="mlworkflow",
            kind="MLWorkflow",
            shortNames=["mlwf"],
        ),
    )
)

# Define the instance of the Custom Resource (CR) using the newly created CRD.
ml_workflow_instance = kubernetes.apiextensions.CustomResource(
    "mlWorkflowInstance",
    api_version="sample.pulumi.com/v1",
    kind="MLWorkflow",
    metadata=kubernetes.meta.v1.ObjectMetaArgs(
        name="example-mlworkflow",
    ),
    other_fields={
        "spec": {
            # These values would be set by the user to configure the ML workflow.
            "modelType": "RandomForest",
            "trainingData": "s3://my-bucket/my-training-data",
        },
    },
    opts=pulumi.ResourceOptions(depends_on=[ml_workflow_crd])
)

# Export the name of the ML workflow instance.
pulumi.export('ml_workflow_instance_name', ml_workflow_instance.metadata["name"])
```

This program starts by importing Pulumi and Pulumi's Kubernetes SDK. We then define the CRD for our `MLWorkflow` with the group `sample.pulumi.com` and version `v1`. In the `spec` section of our CRD, we specify the schema for the custom resource, which includes mandatory fields `modelType` and `trainingData`.

After defining the CRD, we create an instance of it with `kind: MLWorkflow`, providing a specific `modelType` and a reference to `trainingData`. We ensure this custom resource is created after the CRD is applied to the cluster by using `opts=pulumi.ResourceOptions(depends_on=[ml_workflow_crd])`.

The `pulumi.export` statement at the end is used to output the name of the ML workflow instance once it's deployed. This is useful when you want to use this information programmatically or reference it in other parts of your infrastructure setup.

Keep in consideration that this is a foundational setup. For full implementation, you would need to write a controller to handle the logic for these custom resources or manually manage the workflows they represent.