Secure Model Training Data Ingestion with AWS Transfer

Question

Pulumi · Accepted Answer

To set up a secure model training data ingestion system using AWS Transfer Family, you would typically use AWS Transfer for SFTP, a service which provides secure file transfer capabilities fully managed by AWS. The service allows you to set up a file transfer protocol-enabled server to which your clients can connect and upload training data securely.

Here are the steps we will follow in our Pulumi program to implement a secure data ingestion system:

1. Set up an AWS Transfer for SFTP server.
2. Create a user for the SFTP server with an associated IAM role for secure access.
3. Configure a workflow to trigger processing when new data is uploaded (if needed, this can be expanded upon).
4. Ensure that the uploaded data is stored securely in an S3 bucket.

In the following Pulumi program written in Python, we will create these resources with Pulumi.

```python
import pulumi
import pulumi_aws as aws

# Create an S3 bucket to store ingested data files.
data_bucket = aws.s3.Bucket("data-bucket")

# Create an IAM role that the AWS Transfer server will assume.
transfer_server_role = aws.iam.Role("TransferServerRole",
    assume_role_policy="""{
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Principal": {"Service": "transfer.amazonaws.com"},
          "Action": "sts:AssumeRole"
        }
      ]
    }"""
)

# Attach a policy to the role to grant read/write access to the S3 bucket.
bucket_access_policy = aws.iam.RolePolicy("BucketAccessPolicy",
    role=transfer_server_role.id,
    policy=data_bucket.arn.apply(lambda arn: f"""{{
      "Version": "2012-10-17",
      "Statement": [
        {{
          "Effect": "Allow",
          "Action": ["s3:ListBucket", "s3:GetBucketLocation"],
          "Resource": "{arn}"
        }},
        {{
          "Effect": "Allow",
          "Action": ["s3:PutObject", "s3:GetObject", "s3:DeleteObject"],
          "Resource": "{arn}/*"
        }}
      ]
    }}""")
)

# Create the AWS Transfer for SFTP server.
sftp_server = aws.transfer.Server("SftpServer",
    protocols=["SFTP"],
    identity_provider_type="SERVICE_MANAGED",
    endpoint_type="PUBLIC",
    logging_role=transfer_server_role.arn,
    tags={
        "Name": "MyTransferServer",
    }
)

# Create an AWS Transfer user that will authenticate using an SSH key.
transfer_user = aws.transfer.User("TransferUser",
    role=transfer_server_role.arn,
    server_id=sftp_server.id,
    user_name="data-uploader",
    ssh_public_key_body="ssh-rsa AAAAB3N... user@example.com",
    home_directory=f"/{data_bucket.bucket}"
)

# Export the SFTP server endpoint and the bucket name where uploaded data will be stored.
pulumi.export("sftp_server_endpoint", sftp_server.endpoint)
pulumi.export("data_bucket_name", data_bucket.bucket)

```

In this program:
- We create an S3 bucket that will be used to store the training data (`data_bucket`).
- An IAM role is created (`transfer_server_role`) that the AWS Transfer server will assume. This role has an associated policy (`bucket_access_policy`) that allows it to perform basic S3 bucket operations (list, get, put, and delete).
- We set up an AWS Transfer for SFTP server (`sftp_server`) and configure it with SFTP as the protocol, using managed authentication and a public endpoint. We also associate the earlier created IAM role for logging.
- We create a user (`transfer_user`) for the SFTP server with a specified username and public SSH key for authentication, setting the user's home directory to the root of the S3 bucket that we created earlier for data storage.

After you run the Pulumi program, you will get the SFTP server endpoint and the name of the S3 bucket as outputs. You can then share the SFTP server endpoint with your clients, and they can securely upload the model training data using an SFTP client such as FileZilla or WinSCP, along with the username `data-uploader` and their private SSH key corresponding to the public key provided.

Remember to replace `ssh-rsa AAAAB3N... user@example.com` with the actual SSH public key of the user who will be uploading the data.

This setup ensures that your model training data is securely ingested into your AWS environment. From there, you can set up further AWS services to process and train your models as needed.