Serverless Data Ingestion into S3 for AI

Question

Pulumi · Accepted Answer

To set up serverless data ingestion into an Amazon S3 bucket for AI purposes, we will utilize AWS services such as AWS Lambda and Amazon S3. AWS Lambda is a serverless compute service that runs your code in response to events and automatically manages the underlying compute resources for you. Amazon S3 is an object storage service that offers scalable and secure storage for data ingestion.

Below is a Pulumi program written in Python that demonstrates how to create the necessary infrastructure:

1. An S3 bucket is created to store the ingested data.
2. An AWS Lambda function is set up to process and handle data as it comes in.
3. Bucket notifications are configured to trigger the Lambda function when new data arrives.
4. IAM roles and policies are created to give the necessary permissions to the Lambda function for accessing S3.

```python
import pulumi
import pulumi_aws as aws

# Create an S3 bucket to store the data.
data_bucket = aws.s3.Bucket("dataBucket",
    acl="private",  # Set the permissions for the bucket
    versioning=aws.s3.BucketVersioningArgs(
        enabled=True  # Enable versioning for data version control
    ),
)

# IAM role and policy that allows Lambda to access S3 and CloudWatch for logs
lambda_role = aws.iam.Role("lambdaRole",
    assume_role_policy="""{
        "Version": "2012-10-17",
        "Statement": [{
            "Action": "sts:AssumeRole",
            "Effect": "Allow",
            "Principal": {
                "Service": "lambda.amazonaws.com"
            }
        }]
    }"""
)

policy_attachment = aws.iam.RolePolicyAttachment("lambdaRoleAttachment",
    policy_arn="arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole",
    role=lambda_role.name,
)

# Create AWS Lambda function that will process the incoming data into S3
data_processor_lambda = aws.lambda_.Function("dataProcessorLambda",
    runtime="python3.8",  # Specifying the runtime for our Lambda function.
    code=pulumi.AssetArchive({ # The code that will be uploaded to Lambda, typically a zip file.
        '.': pulumi.FileArchive('./lambda')
    }),
    timeout=180,  # The amount of time the Lambda function has to run in seconds.
    handler="lambda_function.handler",  # The function entrypoint in your Python code.
    role=lambda_role.arn,  # Attach the IAM role created above to the Lambda function.
)

# Allow the Lambda function to be invoked by S3
lambda_permission = aws.lambda_.Permission("lambdaPermission",
    action="lambda:InvokeFunction",
    function=data_processor_lambda.name,
    principal="s3.amazonaws.com",
    source_arn=data_bucket.arn,
)

# Set the S3 bucket to send an event to the Lambda function when a new object is created
bucket_notification = aws.s3.BucketNotification("bucketNotification",
    bucket=data_bucket.id,
    lambda_functions=[aws.s3.BucketNotificationLambdaFunctionArgs(
        lambda_function_arn=data_processor_lambda.arn,
        events=["s3:ObjectCreated:*"],
    )],
    opts=pulumi.ResourceOptions(depends_on=[lambda_permission])
)

# Export the S3 bucket name and Lambda function ARN
pulumi.export('bucket_name', data_bucket.id)
pulumi.export('lambda_function_arn', data_processor_lambda.arn)
```

Here's a brief overview of this Pulumi program:

- We create an S3 bucket where the ingested data will be stored.
- We define an IAM role for our AWS Lambda function to have the necessary permissions to be executed and to log events in CloudWatch for monitoring purposes.
- We then create an AWS Lambda function, specifying the runtime environment, the code location, the maximum execution timeout, and the function handler, along with attaching the IAM role.
- Lastly, we give the necessary permissions for our S3 bucket to invoke the Lambda function and configure an S3 bucket event notification to fire whenever new objects are created in the bucket.

Remember, the folder referred to as `./lambda` in the `FileArchive` of the Lambda resource is expected to exist in your working directory and should contain the code for your Lambda function. The `handler` setting corresponds to the file and method name that AWS Lambda would execute within the provided code package. The actual bucket name and the ARN of the Lambda function are exported as outputs, which you can use to reference in other parts of your Pulumi code or in your application.

Make sure to replace the placeholder paths and names with your actual file paths and desired resource names as needed. The policies attached to the Lambda function in this example are quite basic, so you may need to adapt them according to the level of access your function requires.

To run this program:
1. Ensure you have the Pulumi CLI and AWS CLI installed and configured with the appropriate access keys.
2. Save this code in a file with a `.py` extension.
3. Run `pulumi up` in the same directory as your Python file to deploy your infrastructure.