How do I implement AWS Glue Crawlers with Pulumi?
To implement AWS Glue Crawlers with Pulumi, we will create a Glue Crawler that can catalog data stored in an S3 bucket. AWS Glue Crawlers help automate the process of populating the AWS Glue Data Catalog with metadata tables.
Here’s the detailed Pulumi program written in TypeScript:
import * as pulumi from "@pulumi/pulumi";
import * as aws from "@pulumi/aws";
// Create an S3 bucket to store data
const dataBucket = new aws.s3.Bucket("dataBucket");
// Create an IAM role for the Glue Crawler
const glueRole = new aws.iam.Role("glueRole", {
assumeRolePolicy: {
Version: "2012-10-17",
Statement: [{
Action: "sts:AssumeRole",
Effect: "Allow",
Principal: {
Service: "glue.amazonaws.com",
},
}],
},
});
// Attach the AWS Glue service policy to the role
const gluePolicyAttachment = new aws.iam.RolePolicyAttachment("gluePolicyAttachment", {
role: glueRole.name,
policyArn: "arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole",
});
// Create a Glue Database
const glueDatabase = new aws.glue.CatalogDatabase("glueDatabase", {
name: "example_database",
});
// Create a Glue Crawler
const glueCrawler = new aws.glue.Crawler("glueCrawler", {
role: glueRole.arn,
databaseName: glueDatabase.name,
s3Targets: [{
path: dataBucket.arn,
}],
schedule: "cron(0 12 * * ? *)", // Schedule to run daily at 12 PM UTC
classifiers: [],
configuration: JSON.stringify({
Version: 1.0,
CrawlerOutput: {
Partitions: {
AddOrUpdateBehavior: "InheritFromTable",
},
},
}),
schemaChangePolicy: {
deleteBehavior: "LOG",
updateBehavior: "UPDATE_IN_DATABASE",
},
});
// Export the name of the S3 bucket and Glue Crawler
export const bucketName = dataBucket.bucket;
export const crawlerName = glueCrawler.name;
In this program, we:
- Created an S3 bucket to store the data that the Glue Crawler will catalog.
- Created an IAM role and attached the necessary AWS Glue service policy to it.
- Created a Glue Database to store the metadata.
- Created a Glue Crawler with a schedule to run daily at 12 PM UTC, targeting the S3 bucket created earlier.
This setup ensures that the Glue Crawler runs periodically and updates the Glue Data Catalog with the metadata of the data stored in the S3 bucket.
Deploy this code
Want to deploy this code? Sign up for a free Pulumi account to deploy in a few clicks.
Sign upNew to Pulumi?
Want to deploy this code? Sign up with Pulumi to deploy in a few clicks.
Sign upThank you for your feedback!
If you have a question about how to use Pulumi, reach out in Community Slack.
Open an issue on GitHub to report a problem or suggest an improvement.