Sensitive Data Identification with AWS Macie2 for AI Compliance.

Question

Pulumi · Accepted Answer

AWS Macie is a fully managed data security and data privacy service that uses machine learning and pattern matching to discover and protect sensitive data in AWS. As an AI-driven service, Macie automates the discovery of sensitive data at scale and lowers the cost of protecting your data.

To achieve sensitive data identification with AWS Macie for AI compliance using Pulumi, you'll need to set up several resources:

1. **Macie Account** – A resource that represents your Amazon Macie account.
2. **Custom Data Identifier** – Defines the criteria (using regex patterns, keywords, etc.) that Macie uses to identify sensitive data.
3. **Classification Job** – A one-time or scheduled task that Macie performs to scan for sensitive data in specified S3 buckets.

The following Pulumi Python program sets up Macie for sensitive data identification:

```python
import pulumi
import pulumi_aws as aws

# Initialize a Macie account
macie_account = aws.macie2.Account("macie-account",
    status="ENABLED",
    finding_publishing_frequency="FIFTEEN_MINUTES",
)

# Create a custom data identifier to detect sensitive data based on a regex pattern
custom_data_identifier = aws.macie2.CustomDataIdentifier("custom-data-identifier",
    name="SensitiveDataIdentifier",
    description="Detect sensitive data like credit card numbers",
    regex="(\d{4}-){3}\d{4}",  # An example regex that looks like a credit card format
    keywords=["confidential", "SSN"],  # List of keywords to watch for
    maximum_match_distance=50,  # The maximum number of characters between occurrences of regex matches and keywords
)

# Create a classification job to run Macie on the given S3 bucket
s3_bucket = aws.s3.Bucket.get("example-bucket", "example-bucket-name")

classification_job = aws.macie2.ClassificationJob("classification-job",
    job_type="ONE_TIME",  # Can be "ONE_TIME" or "SCHEDULED"
    custom_data_identifier_ids=[custom_data_identifier.id],
    s3_job_definition=aws.macie2.ClassificationJobS3JobDefinitionArgs(
        bucket_definitions=[
            aws.macie2.ClassificationJobS3JobDefinitionBucketDefinitionsArgs( # This example assumes you have S3 buckets already. Else, create them using Pulumi.
                account_id= macie_account.id,
                buckets=[s3_bucket.id]
            )
        ]
    ),
    initial_run=True,
    job_status="RUNNING",
)

pulumi.export("macie_account_id", macie_account.id)
pulumi.export("custom_data_identifier_id", custom_data_identifier.id)
pulumi.export("classification_job_id", classification_job.id)
```

In this program, we start by initializing an AWS Macie account using the `Macie2.Account` resource from Pulumi's AWS SDK. We enable the account and set the frequency at which findings are to be published.

Then, we create a `CustomDataIdentifier` resource that uses a regex pattern to identify what looks like a credit card number and includes certain keywords as additional criteria for detecting sensitive data.

After that, we retrieve an existing AWS S3 bucket (assuming one has already been provisioned) and establish a `ClassificationJob`. The job uses the custom data identifier to scan the S3 bucket(s) for sensitive data. In this example, the job is configured to run one time (`ONE_TIME`). If you wish for the job to be scheduled, you may change the `job_type` to `SCHEDULED` and configure the schedule.

Finally, we export the IDs of the created resources so that their status can be tracked and referenced outside of Pulumi, if necessary.

This program needs to be run within an environment that has the AWS and Pulumi CLI set up and configured with the necessary credentials and permissions to manage these resources. Once executed, this Pulumi script will automate the setup of AWS Macie, tailored for sensitive data identification, thus assisting in maintaining AI compliance.