Storing Training Datasets for Azure Machine Learning

Question

Pulumi · Accepted Answer

To store training datasets for Azure Machine Learning, you can use Azure Machine Learning Datasets to manage and version your training data. Azure Machine Learning Datasets provide a way to create, register, and retrieve datasets used in Machine Learning experiments, ensuring data versioning and reproducibility.

Here is a step-by-step program that demonstrates how to create and register a dataset:

1. **Create a Machine Learning Workspace**: This is the foundational service for all machine learning operations. You'll need this before you can work with datasets.

2. **Create a Datastore**: Datastores are used to store connection information to your storage, without needing to hardcode this sensitive information into your script. It can point to a blob container, ADLS Gen2 filesystem, or other supported storage services where your data is located.

3. **Create and Register a Machine Learning Dataset**: Once we have a workspace and datastore, we can define a dataset that points to data within our datastore and registers that dataset with the workspace so it can be referenced in machine learning jobs.

Below is the Python program that sets up these resources using Pulumi with the `azure-native` provider.

```python
import pulumi
import pulumi_azure_native.machinelearningservices as ml

# Replace these variables with your own desired names and Azure resource values
resource_group_name = 'your_resource_group'
workspace_name = 'your_workspace'
datastore_name = 'your_datastore'
dataset_name = 'your_dataset'

# 1. Create a Machine Learning Workspace
workspace = ml.Workspace("workspace",
    resource_group_name=resource_group_name,
    location="eastus",  # Choose the appropriate region
    sku=ml.SkuArgs(
        name="Basic",  # Use "Enterprise" for more features at additional cost
    ),
)

# 2. Create a Datastore
# Assuming that the storage account and container are already available,
# and that you have their names as well as the access key.
storage_account_name = "your_storage_account"
storage_container_name = "your_container"
storage_account_key = "your_storage_account_key"

datastore = ml.Datastore("datastore",
    resource_group_name=resource_group_name,
    workspace_name=workspace_name,
    datastore_properties=ml.DatastorePropertiesResourceArgs(
        data_store_type="AzureBlob",
        account_name=storage_account_name,
        container_name=storage_container_name,
        account_key=storage_account_key,
    ),
)

# 3. Create and Register a Machine Learning Dataset
dataset = ml.MachineLearningDataset("dataset",
    resource_group_name=resource_group_name,
    workspace_name=workspace_name,
    dataset_name=dataset_name,
    parameters=ml.DatasetArgs(
        path=ml.DatasetPathArgs(
            httpUrl="http://example.com/path/to/your/data.csv", # Update with the path to your data
        ),
    ),
)

# Export the IDs of the created resources
pulumi.export('workspace_id', workspace.id)
pulumi.export('datastore_id', datastore.id)
pulumi.export('dataset_id', dataset.id)
```

Remember to replace placeholder values like `'your_resource_group'`, `'your_storage_account'`, and `'your_storage_account_key'` with actual values you have configured on your Azure account.

This program sets up a machine learning workspace, a datastore, and a dataset within Azure Machine Learning. You can then use this dataset in your machine learning experiments and models. This setup ensures that you have version control over your datasets, which is beneficial for reproducibility and auditing purposes.

After running this Pulumi program, the created resources can be referenced by their IDs in your machine learning workflows. Make sure you have installed Pulumi and configured it to use your Azure credentials. You can run this Pulumi program as a Python script after installing the necessary Pulumi Azure Native SDK via pip (`pip install pulumi_azure_native`).