Serverless Data Preprocessing for Machine Learning using AWS Lambda

Question

Pulumi · Accepted Answer

To implement a serverless data preprocessing solution for machine learning on AWS using AWS Lambda, you need to set up the following:

AWS Lambda Function: A serverless compute service that lets you run code without provisioning or managing servers, which is ideal for data preprocessing tasks.
IAM Role and Policy: Permissions for the Lambda function to access other AWS services such as S3 buckets or AWS Glue if needed for the preprocessing task.
Input Trigger: A mechanism to invoke the Lambda function which can be AWS S3 bucket events, AWS API Gateway for HTTP requests, or manual invocations.
Output: Store the preprocessed data in a destination, such as an S3 bucket or a database, which can be used for machine learning.

Below is a basic Pulumi program that will:

Create an IAM role and policy giving the necessary permissions to the Lambda function.
Set up a Lambda function with a simple data preprocessing handler.
Configure an S3 bucket to trigger the Lambda function on new object creation, which can be used to upload raw data to be processed.
Assume you already have a script preprocessing.py which contains your data preprocessing logic to be deployed in the Lambda function.

Let's walk through the code:

import pulumi
import pulumi_aws as aws

# Create an IAM role and policy that allows the Lambda function to access AWS services
lambda_role = aws.iam.Role("lambdaRole", 
    assume_role_policy="""{
        "Version": "2012-10-17",
        "Statement": [{
            "Action": "sts:AssumeRole",
            "Principal": {
                "Service": "lambda.amazonaws.com"
            },
            "Effect": "Allow",
            "Sid": ""
        }]
    }"""
)

lambda_policy = aws.iam.RolePolicy("lambdaPolicy",
    role=lambda_role.id,
    policy="""{
        "Version": "2012-10-17",
        "Statement": [{
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:PutLogEvents",
                "s3:GetObject"
            ],
            "Resource": "*"
        }]
    }"""
)

# Define your data preprocessing Lambda function.
# Assume 'preprocessing.py' contains the function 'lambda_handler' to handle the preprocessing task.
# 'aws_lambda.Function' expects a ZIP archived source code with the handler, hence using 'pulumi.asset.AssetArchive'.
preprocessing_lambda = aws.lambda_.Function("preprocessingLambda",
    runtime=aws.lambda_.Runtime.PYTHON3_8,
    code=pulumi.FileArchive("./preprocessing.zip"),
    handler="preprocessing.lambda_handler", # Format: <FILENAME>.<HANDLER_FUNCTION_NAME>
    role=lambda_role.arn,
)

# Setup an S3 bucket to store raw and preprocessed data.
raw_data_bucket = aws.s3.Bucket("rawDataBucket")
processed_data_bucket = aws.s3.Bucket("processedDataBucket")

# Grant the Lambda function permission to be invoked by S3 bucket events.
s3_invoke_permission = aws.lambda_.Permission("s3InvokePermission",
    action="lambda:InvokeFunction",
    function=preprocessing_lambda.name,
    principal="s3.amazonaws.com",
    source_arn=raw_data_bucket.arn,
    qualifier=preprocessing_lambda.version,
)

# Configure an S3 bucket event to trigger the Lambda function on new object creation.
s3_bucket_event = aws.s3.BucketNotification("s3BucketEvent",
    bucket=raw_data_bucket.id,
    lambda_functions=[aws.s3.BucketNotificationLambdaFunctionArgs(
        lambda_function_arn=preprocessing_lambda.arn,
        events=["s3:ObjectCreated:*"],
        filter_prefix="data/", # Assume raw data will be uploaded under 'data/' prefix.
        filter_suffix=".csv", # Only for .csv files.
    )]
)

# Exports the names of the buckets and the Lambda function.
pulumi.export("raw_data_bucket_name", raw_data_bucket.bucket)
pulumi.export("processed_data_bucket_name", processed_data_bucket.bucket)
pulumi.export("preprocessing_lambda_name", preprocessing_lambda.name)

Here's a breakdown of what the program does:

It sets up an IAM role and policy for the Lambda function, allowing it to write logs and access objects in S3.
It defines a Lambda function with Python runtime, pointing to a ZIP file (preprocessing.zip) which should contain your preprocessing script and dependencies.
It creates two S3 buckets: one for raw data input and another for preprocessed data output.
It grants permission to the S3 bucket to trigger the Lambda function and sets up the event notifications to call the Lambda function when new CSV files are uploaded to the data/ directory.

Make sure to replace the placeholder preprocessing.zip with the actual path to your zip archive containing the preprocessing.py script and its dependencies.

To use this program:

Install the Pulumi CLI and configure the AWS provider.
Write your data preprocessing logic in preprocessing.py.
Archive your script as preprocessing.zip and ensure the handler function is correctly named.
Run the Pulumi program with pulumi up, which deploys your infrastructure to AWS.
Upload your raw CSV data to the data/ folder in the rawDataBucket to trigger preprocessing.