Automated Dataset Versioning with AWS S3 Inventory

Question

Pulumi · Accepted Answer

To achieve automated dataset versioning with AWS S3 Inventory, you can create an S3 bucket that has versioning enabled and configure an inventory to keep track of the objects and their versions. The S3 inventory provides a CSV, ORC, or Parquet file containing a list of the objects in the bucket, and this can be configured to include the version of each object.

Here's how you can set this up using Pulumi with Python:

1. **Create an S3 bucket**: This bucket will store your datasets.
2. **Enable versioning**: This allows you to keep multiple versions of an object in one bucket.
3. **Configure an S3 Inventory**: This tells AWS to generate a report with all the objects and their versions in the specified bucket.
4. **Define the destination for Inventory reports**: Specifies where inventory reports are delivered.

Let's walk through the Pulumi code to set this up:

```python
import pulumi
import pulumi_aws as aws

# Step 1: Create an S3 bucket to store your datasets.
dataset_bucket = aws.s3.Bucket("datasetBucket",
    versioning=aws.s3.BucketVersioningArgs(
        status="Enabled"  # This enables versioning on the bucket.
    )
)

# Step 2: Define the destination bucket for the inventory reports.
# Typically, this could be a separate bucket for logging.
inventory_bucket = aws.s3.Bucket("inventoryBucket")

# Refer to the AWS S3 Inventory documentation for more details:
# https://www.pulumi.com/registry/packages/aws/api-docs/s3/inventory/
inventory_configuration = aws.s3.Inventory("inventoryConfiguration",
    bucket=dataset_bucket.id,  # Bucket to be inventoried.
    destination=aws.s3.InventoryDestinationArgs(
        bucket=aws.s3.InventoryBucketArgs(
            format="CSV",  # The format of the inventory, CSV, ORC, or Parquet.
            bucket_arn=inventory_bucket.arn,  # ARN of the destination bucket.
            # If you wish to use S3 bucket encryption, you can configure it here.
        ),
    ),
    schedule=aws.s3.InventoryScheduleArgs(
        frequency="Daily"  # How frequently the inventory should be generated.
    ),
    included_object_versions="All",  # To inventory all versions of each object.
    enabled=True,  # Set this to true to enable the inventory configuration.
)

# Export the URLs of the buckets for easy access.
pulumi.export("dataset_bucket_url", dataset_bucket.website_endpoint)
pulumi.export("inventory_bucket_url", inventory_bucket.website_endpoint)
```

This code snippet will create the necessary AWS S3 buckets and set up an inventory configuration that automatically tracks and versions your datasets. You will be able to access inventory reports in the specified destination bucket, which will include detailed information about your S3 objects and their versions. By setting the inventory to "All" included object versions, you ensure that every version of each file is accounted for, thus achieving dataset versioning.

Make sure that you have AWS credentials configured correctly on your system where Pulumi will be running. After deploying this code with Pulumi, your AWS account will have a new bucket with versioning enabled and an inventory configuration that captures dataset versions.