Distributed File Storage for AI Research Workloads using AWS FSx

Question

Pulumi · Accepted Answer

Creating distributed file storage suitable for AI research workloads using AWS FSx can provide the high-performance computing environment required for such intensive tasks. AWS FSx for Lustre is an optimal choice for this purpose as it offers a fully managed file system that is optimized for compute-intensive workloads such as high-performance computing (HPC), machine learning, and data analytics.

AWS FSx for Lustre integrates with Amazon S3, allowing you to access and process data with high throughput and low latency. Also, it is designed for workloads that require fast storage, where millions of files can be stored and accessed at speeds up to hundreds of gigabytes per second of throughput.

Below you'll find a Pulumi Python program that sets up an AWS FSx for Lustre file system. This program performs the following actions:

1. Creates an AWS FSx for Lustre file system that's linked to an S3 bucket for data import/export.
2. Sets up a security group to control access to the file system.
3. Outputs the DNS name of the file system, which can be used to mount the file system on your compute instances.

Please ensure you have the AWS CLI configured with the required permissions and Pulumi CLI installed to run this program.

Let's begin with the Pulumi program:

```python
import pulumi
import pulumi_aws as aws

# Create an S3 bucket to store AI workload data. This bucket can be interfaced with FSx.
ai_data_bucket = aws.s3.Bucket("aiDataBucket")

# A security group to allow access to the FSx file system
# By default, this example allows all outbound traffic and should be configured to allow traffic from specific sources only.
fsx_security_group = aws.ec2.SecurityGroup("fsxSecurityGroup",
    description="Allow access to FSx",
    egress=[{
        "cidr_blocks": ["0.0.0.0/0"],
        "from_port": 0,
        "to_port": 0,
        "protocol": "-1",
    }],
)

# Create an FSx Lustre file system
fsx_file_system = aws.fsx.LustreFileSystem("fsxLustreAiFileSystem",
    storage_capacity=1200,   # Minimum storage capacity for SSD is 1.2 TiB
    subnet_ids=[aws_subnet_id],  # Replace aws_subnet_id with the id of your chosen VPC subnet
    security_group_ids=[fsx_security_group.id],
    s3_import_path="s3://" + ai_data_bucket.bucket,  # S3 bucket used to import data to the file system
    s3_export_path="s3://" + ai_data_bucket.bucket,  # S3 bucket that FSx will automatically export data to
    deployment_type="PERSISTENT_1",  # PERSISTENT_1 is used for longer-term data storage
    export_path="s3://" + ai_data_bucket.bucket,     # Optionally, specify an export path for backup/versioning
    )

# Output the DNS name of the file system, which can be used to mount the file system
pulumi.export("fsx_dns_name", fsx_file_system.dns_name)
```

The `aws.fsx.LustreFileSystem` resource creates a new Lustre file system in FSx. The `storage_capacity` parameter specifies the size of the file system. In this example, it's set to 1200 GB, which is the minimum for SSD storage options. Ensure that the subnet ID provided in `subnet_ids` is a subnet in the same VPC where your AI compute instances are launched.

The security group `fsx_security_group` is very basic and allows all outbound traffic, which you may need to restrict based on your security requirements.

The `s3_import_path` and `s3_export_path` parameters link your FSx file system with the specified S3 bucket, enabling seamless data import from and export to the S3 bucket.

After deploying this program with Pulumi, you will get the DNS name of the FSx file system as an output, which can be then used to mount the file system on EC2 instances or your on-premises servers.

Please replace `aws_subnet_id` with the actual ID of your preferred subnet where the AI workloads will run. Ensure this subnet is in the appropriate VPC and has the necessary route table and network ACL configurations.