Cross-Account Machine Learning Data Sharing

Question

Pulumi · Accepted Answer

To achieve cross-account machine learning (ML) data sharing in a cloud environment, you would typically configure permissions and resource sharing between two accounts so that one account can access ML datasets and resources from the other. While there are various ways to implement this depending on the cloud provider, I will illustrate how you could do this within the context of AWS, as it's one of the most common providers for such use cases.

In AWS, you can utilize services such as AWS Identity and Access Management (IAM) to control permissions, Amazon S3 for data storage, and AWS Resource Access Manager (RAM) to share resources between accounts. Additionally, if your ML workloads are orchestrated through Amazon SageMaker, you can take advantage of its specific cross-account capabilities.

Let's walk through a simple Pulumi program to set this up:

1. **IAM Roles and Policies**: Define IAM roles with policies that grant the necessary permissions for cross-account access.
2. **S3 Buckets**: Establish an S3 bucket where the ML data will be stored.
3. **Resource Sharing**: Use AWS RAM to share the S3 bucket with the other AWS account.

This program assumes you have configured your Pulumi CLI with credentials that have the necessary permissions to create and manage these resources.

```python
import pulumi
import pulumi_aws as aws

# The account IDs for both the data provider and data consumer accounts.
provider_account_id = '123456789012'  # Replace with your provider account ID
consumer_account_id = '210987654321'  # Replace with your consumer account ID

# Create an IAM role for cross-account access
cross_account_role = aws.iam.Role('crossAccountRole',
    assume_role_policy=pulumi.Output.all(provider_account_id, consumer_account_id).apply(lambda ids: json.dumps({
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Principal": {
                    "AWS": f"arn:aws:iam::{ids[1]}:root"
                },
                "Action": "sts:AssumeRole",
            }
        ]
    }))
)

# Attach a policy to the IAM role that allows access to the S3 bucket
s3_access_policy = aws.iam.RolePolicy('s3AccessPolicy',
    role=cross_account_role.name,
    policy=pulumi.Output.all(provider_account_id).apply(lambda id: json.dumps({
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "s3:GetObject",
                    "s3:ListBucket"
                ],
                "Resource": [
                    f"arn:aws:s3:::ml-data-bucket-{id}/*",
                    f"arn:aws:s3:::ml-data-bucket-{id}"
                ]
            }
        ]
    }))
)

# Create an S3 bucket to hold ML data
ml_data_bucket = aws.s3.Bucket('mlDataBucket',
    bucket=f"ml-data-bucket-{provider_account_id}"
)

# Share the S3 bucket with the consumer account using AWS Resource Access Manager
resource_share = aws.ram.ResourceShare('resourceShare',
    allow_external_principals=True,
    principals=[consumer_account_id],
    resource_arns=[ml_data_bucket.arn]
)

# Outputs that provide necessary information and resource identifiers
pulumi.export('provider_account_id', provider_account_id)
pulumi.export('consumer_account_id', consumer_account_id)
pulumi.export('cross_account_role', cross_account_role.arn)
pulumi.export('ml_data_bucket', ml_data_bucket.bucket)
```

This Pulumi program takes the following steps:

- It sets up an IAM role with a trust policy that allows the consumer account to assume the role. This role will be used by entities in the consumer account to interact with resources in the provider account.
- It creates a resource access policy attached to the IAM role, allowing access to an S3 bucket that contains the shared ML data.
- It provisions an S3 bucket for storing the ML data.
- It uses AWS RAM to share the S3 bucket with the consumer account.

Make sure to replace the placeholder account IDs with actual AWS account IDs involved in the data sharing process. After deploying this Pulumi program, you will need to configure the consumer account to assume the IAM role and access the S3 bucket according to the permissions you've defined.

Please note, this is a basic setup for demonstration purposes. Depending on your use case, you might need to add more configurations, like adjusting bucket policies for finer access control, setting up encryption for data at rest, or configuring additional AWS services.