Preparing Datasets for LLMs with AWS Glue Crawler

Question

Pulumi · Accepted Answer

AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. When working with large language models (LLMs), it's common to have an extensive dataset that needs to be processed and made available for the model to train on. One of the tools provided by AWS Glue for this process is the AWS Glue Crawler.

The AWS Glue Crawler scans your data sources and creates metadata tables in the AWS Glue Data Catalog, which is a centralized metadata repository. These metadata tables can then be used for data discovery, ETL, and querying transformed data through Amazon Athena, Amazon Redshift Spectrum, and other services.

Below is the Pulumi Python program that will create an AWS Glue Crawler resource. This program assumes that you have already set up the necessary AWS credentials and Pulumi configurations for deploying AWS resources. The program sets up a Crawler for a hypothetical S3 data source (where your datasets may be stored) and configures it with the necessary AWS IAM role to allow Glue to access the S3 resources.

Remember to replace the placeholder values such as `s3_path_to_your_data` and `your_glue_service_role` with your actual S3 dataset path and the IAM role ARN, respectively.

Let's look at the code to create an AWS Glue Crawler:

```python
import pulumi
import pulumi_aws as aws

# Create an AWS Glue Crawler to scan your S3 data source and populate the Data Catalog
# You'll need to have an IAM role with the necessary permissions for the Crawler
glue_crawler = aws.glue.Crawler("my-dataset-crawler",
    # Assign the role that has the necessary permissions for accessing the S3 data source 
    role="arn:aws:iam::<account-id>:role/your_glue_service_role",
    # Specify your S3 data source path
    s3_targets=[{
        "path": "s3://bucket-name/s3_path_to_your_data",
    }],
    # Set the database within which the metadata tables will be created
    database_name="my_glue_database",
    # Optional configurations can be added below, such as classifiers, schedule, schema change policy, etc.
    # Schedule to run the Crawler every day at a specified time (cron format)
    schedule="cron(0 12 * * ? *)",
    # For example, if you don't want the crawler to create new tables for changed schemas in your source data
    schema_change_policy={
        "delete_behavior": "LOG",
        "update_behavior": "UPDATE_IN_DATABASE",
    }
)

# The crawler can now be used to populate the AWS Glue Data Catalog with tables based on your S3 datasets.
# These tables can be used by ETL jobs, DataBrew recipes, and directly queried using Amazon Athena.

# Export the name of the crawler
pulumi.export("crawler_name", glue_crawler.name)
```

This program will deploy an AWS Glue Crawler named `my-dataset-crawler` with the specified configurations. You can further customize the crawler with various options like classifiers, different types of data targets (including JDBC targets, DynamoDB targets, and more), and a schema change policy based on your requirements.

Once the crawler runs, according to the schedule you define or when triggered manually, the metadata tables it creates will be stored in the specified Glue Data Catalog database named `my_glue_database`. These tables contain the schema information of your datasets and can be used for querying and processing the data further.

For more information about the Glue Crawler resource and its properties, you can refer to the [AWS Glue Crawler documentation](https://www.pulumi.com/registry/packages/aws/api-docs/glue/crawler/).