Scheduled Data Ingestion for S3 Data Lakes

Question

Pulumi · Accepted Answer

Scheduled data ingestion for S3 data lakes typically involves automating the process of transferring and managing data from various sources into an Amazon S3 bucket, which serves as the data lake storage. For this to run on a schedule, you might leverage AWS services like AWS Lambda for compute tasks, Amazon EventBridge (previously CloudWatch Events) for scheduling, and AWS Step Functions for coordinating workflows.

Below is a Python program using Pulumi to set up a simple scheduled data ingestion process into an S3-based data lake:

1. **AWS Lambda Function**: This function will handle the task of moving data into our S3 bucket. It could execute any data ingestion logic you have in mind (such as pulling data from an API or processing files).

2. **Amazon EventBridge**: This is used to trigger the Lambda function on a schedule. You can set it up to execute at a specific time interval (e.g., every hour or once a day).

3. **IAM Role and Policy**: The Lambda function needs permissions to access the S3 bucket. The policy grants the necessary S3 permissions, and the role allows the Lambda function to assume it.

4. **S3 Bucket**: The destination for the ingested data, which forms the core of your data lake.

```python
import pulumi
import pulumi_aws as aws

# Create an S3 bucket that will act as our Data Lake
data_lake_bucket = aws.s3.Bucket("dataLakeBucket",
    acl="private",
    versioning=aws.s3.BucketVersioningArgs(
        enabled=True,
    ))

# IAM role which the Lambda function will assume
lambda_role = aws.iam.Role("lambdaRole",
    assume_role_policy="""{
        "Version": "2012-10-17",
        "Statement": [{
            "Action": "sts:AssumeRole",
            "Effect": "Allow",
            "Principal": {
                "Service": "lambda.amazonaws.com"
            }
        }]
    }""")

# Policy allowing our Lambda to access the S3 bucket
lambda_policy = aws.iam.RolePolicy("lambdaPolicy",
    role=lambda_role.name,
    policy=data_lake_bucket.arn.apply(lambda arn: f"""{{
        "Version": "2012-10-17",
        "Statement": [{{
            "Effect": "Allow",
            "Action": ["s3:PutObject", "s3:GetObject"],
            "Resource": "{arn}/*"
        }}]
    }}"""))

# Lambda function doing the data ingestion
data_ingestion_lambda = aws.lambda_.Function("dataIngestionLambda",
    code=pulumi.FileArchive("./lambda.zip"),
    role=lambda_role.arn,
    handler="index.handler",
    runtime="python3.8")

# EventBridge rule to trigger our Lambda on a schedule – here, 'rate(1 day)' means once a day
lambda_schedule = aws.cloudwatch.EventRule("lambdaSchedule",
    schedule_expression="rate(1 day)")

# Target that links the rule to our Lambda function
lambda_target = aws.cloudwatch.EventTarget("lambdaTarget",
    rule=lambda_schedule.name,
    arn=data_ingestion_lambda.arn)

# Permission for EventBridge to invoke the Lambda function
invoke_permission = aws.lambda_.Permission("invokePermission",
    action="lambda:InvokeFunction",
    function=data_ingestion_lambda.name,
    principal="events.amazonaws.com",
    source_arn=lambda_schedule.arn)

# Export the S3 bucket URL to be accessed
pulumi.export("data_lake_bucket_url", data_lake_bucket.website_endpoint)
```

**Explanation:**

- `aws.s3.Bucket` creates a new S3 bucket with versioning enabled for data storage.
- `aws.iam.Role` sets up a new IAM role that the AWS Lambda function will assume.
- `aws.iam.RolePolicy` attaches a policy to the IAM role that grants the Lambda function permissions to put objects into the S3 bucket.
- `aws.lambda_.Function` creates a new Lambda function to handle the data ingestion logic, where `code` points to a zipped archive of your Python script.
- `aws.cloudwatch.EventRule` sets up a new EventBridge rule to invoke our Lambda function on the defined schedule.
- `aws.cloudwatch.EventTarget` attaches the Lambda function as a target to the scheduled EventRule.
- `aws.lambda_.Permission` grants EventBridge the necessary permission to invoke the Lambda function.

This Pulumi program sets up the necessary resources to have a scheduled data ingestion process to feed into your S3 Data Lake. Make sure to replace `"./lambda.zip"` with the path to your zipped Lambda function code and define the `handler` property according to your script.

*Note: The actual data ingestion logic would need to be defined within your Lambda function's code, which isn't covered by the Pulumi program.*