Incremental Data Ingestion for ML with AWS Glue Crawler

Question

Pulumi · Accepted Answer

To set up incremental data ingestion for a machine learning (ML) application, we can use AWS Glue, which is a fully managed extract, transform, and load (ETL) service. AWS Glue offers several components, but for incremental data ingestion, the primary tool is the AWS Glue Crawler. AWS Glue Crawlers scan various data stores to automatically infer schemas and populate the AWS Glue Data Catalog with tables. This is helpful for ML applications where you might have large datasets continually growing over time.

To accomplish this with Pulumi in Python, we will define a Pulumi program that:

1. Sets up an AWS Glue Crawler to scan an S3 data source.
2. Optionally, creates a schedule for the crawler to run periodically.
3. Creates a Glue Database to organize the metadata of the detected schemas.
4. Uses AWS Glue Triggers to respond to data changes or schedule events.

Below is the Python program that uses Pulumi to set up incremental data ingestion using AWS Glue for machine learning purposes.

```python
import pulumi
import pulumi_aws as aws

# Create an AWS Glue Catalog Database where the metadata of crawled data will be stored.
glue_database = aws.glue.CatalogDatabase("ml_data_catalog_database",
    name="ml_data_database")

# The role that will be used by AWS Glue Crawler needs appropriate permissions to access
# the S3 data source and update the AWS Glue Data Catalog.
glue_role = aws.iam.Role("glue_crawler_role",
    assume_role_policy="""{
        "Version": "2012-10-17",
        "Statement": [{
            "Action": "sts:AssumeRole",
            "Effect": "Allow",
            "Principal": {
                "Service": "glue.amazonaws.com"
            }
        }]
    }""")

# Attach a policy to the role for the necessary permissions.
glue_policy_attachement = aws.iam.RolePolicyAttachment("glue_crawler_role_policy_attachment",
    role=glue_role.name,
    policy_arn=aws.iam.ManagedPolicy.AWS_GLUE_SERVICE_ROLE)

# Define the S3 target where the source data for ingestion is located.
s3_target = aws.glue.CrawlerS3TargetArgs(
    path="s3://my-ml-data-bucket/raw-data/"
)

# Create the Glue Crawler that will get metadata from the S3 data source.
glue_crawler = aws.glue.Crawler("ml_data_crawler",
    database_name=glue_database.name,
    role=glue_role.arn,
    s3_targets=[s3_target])

# Optionally, create a schedule for the AWS Glue Crawler.
# The example below runs the crawler once a day. The schedule is in the cron format.
# Refer to the AWS docs to customize this schedule https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html
glue_crawler_schedule = aws.glue.Crawler("scheduled_ml_data_crawler",
    database_name=glue_database.name,
    role=glue_role.arn,
    s3_targets=[s3_target],
    schedule="cron(0 12 * * ? *)")

# Create a trigger that runs whenever the Glue Crawler detects a change in the data source.
glue_trigger = aws.glue.Trigger("ml_data_crawler_trigger",
    type="CONDITIONAL",
    actions=[aws.glue.TriggerActionArgs(
        crawler_name=glue_crawler.name
    )],
    predicate=aws.glue.TriggerPredicateArgs(
        conditions=[aws.glue.TriggerConditionArgs(
            crawl_state="SUCCEEDED",
            crawler_name=glue_crawler.name
        )]
    ))

# Export the Glue Crawler name and Glue Database name for easy reference.
pulumi.export("glue_crawler_name", glue_crawler.name)
pulumi.export("glue_database_name", glue_database.name)
```

This program does the following:

- Initializes a new AWS Glue Catalog Database, which will hold the tables created by the Glue Crawler.
- Creates an IAM role for the AWS Glue Crawler with the `AWSGlueServiceRole` policy attached, which grants it permissions to perform its duties.
- Specifies the S3 target containing the raw data files for the crawler.
- Defines a Glue Crawler resource that will crawl the S3 target and store the discovered metadata in the Glue Catalog Database.
- Optionally defines a schedule for the Glue Crawler using cron syntax, in this example, setting it to run once per day at noon UTC.
- Sets up a Glue Trigger that will start an ETL job or another action when conditions (e.g., the crawler successfully completing its task) are met.

Do note that you will need to have your AWS account configured with the correct permissions and S3 bucket in place for this to work. Make sure the S3 path in the `s3_target` matches where your ML data is stored. You should substitute `"my-ml-data-bucket/raw-data/"` with your actual bucket and path names.