1. Scheduled Machine Learning Data Pipeline Runs Using AWS Glue Triggers


    When dealing with large machine learning workflows, data pipelines are crucial for processing and moving data between different storage and computation services. AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load your data for analytics. You can create, schedule, and run your ETL jobs with AWS Glue. A key feature is the AWS Glue Trigger, which allows you to schedule jobs or invoke them based on certain conditions.

    The following Pulumi Python program will demonstrate how to schedule a machine learning data pipeline using AWS Glue. The main components used will include:

    • aws.glue.Trigger: To manage the conditions under which your ETL job runs. You can create triggers that fire based on a schedule or when certain events occur.
    • aws.glue.Crawler: To populate the AWS Glue Data Catalog with tables. These are generated from your data source schema.
    • aws.glue.Job: Defined by a script that processes your data and can be initiated by an event trigger or on a schedule.

    Here is a program that sets up a scheduled AWS Glue Trigger that runs an ETL job tailored for a machine learning data pipeline:

    import pulumi import pulumi_aws as aws # First, let's define the IAM role that AWS Glue will assume to perform tasks on your behalf. glue_role = aws.iam.Role("glue_role", assume_role_policy="""{ "Version": "2012-10-17", "Statement": [{ "Action": "sts:AssumeRole", "Principal": {"Service": "glue.amazonaws.com"}, "Effect": "Allow" }] }""" ) # Attaching policies that will allow Glue to interact with other AWS services. policy_attachment = aws.iam.RolePolicyAttachment("role_policy_attachment", role=glue_role.name, policy_arn=aws.iam.ManagedPolicy.AWS_GLUE_SERVICE_ROLE.value ) # Define a crawler to populate data schemas into the AWS Glue Data Catalog. crawler = aws.glue.Crawler("crawler", role_arn=glue_role.arn, database_name="ml_data", s3_targets=[{ "path": "s3://my-bucket/raw-data", }] ) # Define the AWS Glue job. # The script location and additional job parameters would be configured here based on your ML use-case and data formats. job = aws.glue.Job("job", role_arn=glue_role.arn, command={ "scriptLocation": "s3://my-bucket/scripts/my-glue-job.py", # Script in the S3 bucket that Glue will execute, should include the ML pipeline processing. }, default_arguments={ "--TempDir": "s3://my-bucket/temp-dir", "--job-language": "python", # Assuming the script is written in Python. } ) # Schedule the trigger to run the job every day at midnight UTC trigger = aws.glue.Trigger("trigger", actions=[{"jobName": job.name}], schedule="cron(0 0 * * ? *)", type="SCHEDULED", # 'cron' syntax to specify the schedule. Here it is set to every day at midnight UTC. ) # Optionally, export the Glue Job name for reference. pulumi.export("glue_job_name", job.name)

    In the program above, we created an IAM role that AWS Glue can assume to perform operations. We also attached the necessary policies that allow AWS Glue to interact with other AWS services. Replace my-bucket and the paths with your actual S3 bucket and file locations.

    The crawler scans the specified S3 bucket and infers schemas and stores them in the Glue Data Catalog, which acts as a metadata repository.

    The job represents your ETL logic, which will be written in a Python script (placed on S3) and referenced in the scriptLocation property.

    Finally, we created a trigger which runs the job according to a specified schedule. The cron(0 0 * * ? *) expression we provided schedules the job to run every day at midnight UTC. You should modify this schedule based on your processing needs.

    To set up this infrastructure, ensure you have Pulumi installed and configured to access your AWS account. After you paste the code into a __main__.py file in a new Pulumi project, run pulumi up to preview and deploy the resources.