Airflow Pipelines for Preprocessing and Feature Engineering on AWS MWAA

Question

Pulumi · Accepted Answer

To set up Airflow pipelines for preprocessing and feature engineering, you would use AWS Managed Workflows for Apache Airflow (MWAA). This service simplifies running Apache Airflow on AWS without needing to manage the underlying infrastructure. With AWS MWAA, you can focus on developing your DAGs (Directed Acyclic Graphs), which are the set of tasks that you want Airflow to execute.

The following steps and the Pulumi program written in Python will guide you through the creation of an AWS MWAA environment.

1. **Define an IAM Role**: AWS MWAA requires an execution role with policies that allow it to access AWS resources like Amazon S3 for storing your DAGs, logs, and other related artifacts.

2. **Create S3 Buckets**: You need to create S3 buckets for storing the DAGs and the logs. The S3 buckets must be accessible by the MWAA environment.

3. **Create an MWAA Environment**: Once you have the IAM role and S3 buckets, you create an AWS MWAA environment specifying these components and other configurations such as the Airflow version, environment class, and max workers.

Below is a Pulumi program that accomplishes these tasks:

```python
import pulumi
import pulumi_aws as aws

# Create an IAM policy to allow MWAA to access S3 resources.
mwaa_policy_document = aws.iam.get_policy_document(statements=[
    aws.iam.GetPolicyDocumentStatementArgs(
        actions=["s3:GetObject", "s3:ListBucket"],
        resources=[
            pulumi.Output.concat("arn:aws:s3:::", pulumi_mwaa_bucket.bucket),
            pulumi.Output.concat("arn:aws:s3:::", pulumi_mwaa_bucket.bucket, "/*"),
        ],
    )
])

# Create the IAM role for MWAA with the policy.
mwaa_role = aws.iam.Role("mwaa-role",
    assume_role_policy=mwaa_policy_document.json
)

# Attach AWS pre-defined policy for MWAA to the role we created.
aws_iam_role_policy_attachment = aws.iam.RolePolicyAttachment("mwaa-policy-attachment",
    role=mwaa_role.name,
    policy_arn="arn:aws:iam::aws:policy/service-role/AmazonMWAAServiceRolePolicy"
)

# Create an S3 bucket to store DAGs and other related artifacts.
pulumi_mwaa_bucket = aws.s3.Bucket("pulumi-mwaa-bucket",
    acl="private"
)

# Create the MWAA environment.
pulumi_mwaa_environment = aws.mwaa.Environment("pulumi-mwaa-environment",
    role_arn=mwaa_role.arn,
    airflow_version="2.0.2",  # Specify the Airflow version or default will be chosen.
    # Specify the S3 bucket created earlier for storing the DAGs.
    dag_s3_path=pulumi_mwaa_bucket.id.apply(lambda id: f"{id}/dags"),
    # Specify the S3 bucket created earlier for storing the logs.
    logging_configuration=aws.mwaa.EnvironmentLoggingConfigurationArgs(
        dag_processing_logs=aws.mwaa.EnvironmentLoggingConfigurationDagProcessingLogsArgs(
            enabled=True,
            log_level="WARNING",
        ),
        scheduler_logs=aws.mwaa.EnvironmentLoggingConfigurationSchedulerLogsArgs(
            enabled=True,
            log_level="WARNING",
        ),
        task_logs=aws.mwaa.EnvironmentLoggingConfigurationTaskLogsArgs(
            enabled=True,
            log_level="WARNING",
        ),
        webserver_logs=aws.mwaa.EnvironmentLoggingConfigurationWebserverLogsArgs(
            enabled=True,
            log_level="WARNING",
        ),
        worker_logs=aws.mwaa.EnvironmentLoggingConfigurationWorkerLogsArgs(
            enabled=True,
            log_level="WARNING",
        ),
    ),
    # More options can be provided based on requirements.
    # For detailed resource specifications visit: https://www.pulumi.com/registry/packages/aws/api-docs/mwaa/environment/
)

# Output the Airflow UI URL.
pulumi.export("airflow_ui_url", pulumi_mwaa_environment.webserver_url)
```

Here's what this program does:

- **IAM Role**: We define an IAM policy document that allows actions such as `s3:GetObject` and `s3:ListBucket` for the MWAA to access S3 resources where DAGs and logs will be stored. Then we create an IAM role with this policy document.

- **S3 Bucket**: We create an S3 bucket with private access where Airflow's DAGs and other necessary artifacts will reside.

- **MWAA Environment**: Finally, we set up the MWAA environment. We specify the IAM role for execution, the Airflow version to use, and paths for DAGs and logging within our S3 bucket. We also define logging configurations for different components of Airflow.

Once this code is run using the Pulumi CLI, it provisions the necessary resources in your AWS account, and you can then access the Airflow UI through the URL that is provided as an output of the script.

Before running Pulumi programs, make sure that you have the Pulumi CLI and the AWS CLI installed and configured with the necessary AWS credentials. To run the program, save it to a file (e.g., `main.py`), ensure you have a `Pulumi.yaml` defining the project and stack, and then execute `pulumi up` from your command line in the directory where your program is saved. This will initiate the deployment process.