Decoupling Data Ingestion from Model Scoring

Question

Pulumi · Accepted Answer

To achieve decoupling of data ingestion from model scoring, you can use cloud-based components to build a system where the ingestion process and the model scoring process can run independently, allowing for more scalable and maintainable architecture.

In the context of a Pulumi program, we might use several cloud services in combination:

1. **Data ingestion**: This can happen through managed services like AWS Kinesis, Google Pub/Sub, or Azure Event Hubs, which can efficiently handle high-volume data streams.
2. **Data storage**: The data can be temporarily stored in a cloud storage service like AWS S3, Google Cloud Storage, or Azure Blob Storage to ensure it is safely persisted before processing.
3. **Data processing and model scoring**: This step often involves specialized services for data processing like AWS Lambda, Google Cloud Functions, or Azure Functions, which can be triggered by the arrival of new data in the storage. These services can run the model scoring code without managing servers.

Here's a high-level Pulumi program that uses AWS services to create a simple architecture for decoupling data ingestion from model scoring:

1. An Amazon Kinesis stream captures real-time data.
2. An AWS Lambda function is triggered by the Kinesis stream to perform data preprocessing.
3. The preprocessed data is stored in an Amazon S3 bucket, where it awaits scoring.
4. A separate AWS Lambda function is set up for model scoring and can be triggered as needed, possibly on a schedule or via an API call.

```python
import pulumi
import pulumi_aws as aws

# Create a Kinesis stream for data ingestion.
kinesis_stream = aws.kinesis.Stream("data-ingestion-stream")

# IAM Role and Policy that grants a Lambda Function read access to the Kinesis Stream.
lambda_role = aws.iam.Role("lambdaRole",
    assume_role_policy="""{
      "Version": "2012-10-17",
      "Statement": [{
        "Action": "sts:AssumeRole",
        "Effect": "Allow",
        "Principal": {
          "Service": "lambda.amazonaws.com"
        }
      }]
    }"""
)

lambda_policy = aws.iam.RolePolicy("lambdaPolicy",
    role=lambda_role.id,
    policy=kinesis_stream.arn.apply(lambda arn: """{
      "Version": "2012-10-17",
      "Statement": [{
        "Effect": "Allow",
        "Action": ["kinesis:GetRecords", "kinesis:GetShardIterator", "kinesis:DescribeStream", "kinesis:ListStreams"],
        "Resource": "%s"
      }]
    }""" % arn)
)

# AWS Lambda function for data preprocessing that gets triggered by Kinesis Stream.
data_preprocessing_lambda = aws.lambda_.Function("dataPreprocessing",
    role=lambda_role.arn,
    handler="index.handler",
    runtime="python3.8",
    code=pulumi.AssetArchive({ 
        ".": pulumi.FileArchive("./preprocessing_lambda_code")
    }),
    environment={
        "variables": {
            "TARGET_BUCKET_NAME": "target-bucket-for-ingested-data"
        }
    },
    # Ensure that the Lambda Function is created only after the Kinesis stream exists.
    opts=pulumi.ResourceOptions(depends_on=[kinesis_stream])
)

# Grant the Lambda function read access to the Kinesis Stream.
aws.lambda_.Permission("lambdaKinesisPermission",
    action="lambda:InvokeFunction",
    function=data_preprocessing_lambda.name,
    principal="kinesis.amazonaws.com",
    source_arn=kinesis_stream.arn
)

# S3 Bucket for the preprocessed data.
preprocessed_data_bucket = aws.s3.Bucket("preprocessed-data-bucket")

# AWS Lambda function for model scoring.
model_scoring_lambda = aws.lambda_.Function("modelScoring",
    role=lambda_role.arn,
    handler="index.handler",
    runtime="python3.8",
    code=pulumi.AssetArchive({
        ".": pulumi.FileArchive("./scoring_lambda_code")
    })
    # The model scoring Lambda could be triggered by an event or manually on demand.
)

# Exports
pulumi.export("kinesis_stream_name", kinesis_stream.name)
pulumi.export("preprocessed_data_bucket_name", preprocessed_data_bucket.id)
```

Before you run this code:

- Replace `./preprocessing_lambda_code` and `./scoring_lambda_code` with the paths to your actual Lambda function code for data preprocessing and model scoring, respectively.
- Customize the Lambda `handler` and `runtime` as per your implementation.
- The IAM Role and Policy are defined for the Lambda function's execution to access the Kinesis Stream. Adjust the permissions according to your specific security and functionality requirements.
- Model scoring Lambda function is started and can be triggered by another event, scheduled, or called via API.

Make sure that you have the AWS CLI configured with the necessary permissions. Install the Pulumi CLI and authenticate it with your cloud provider. Once that's set up, you can run `pulumi up` to create these resources in your AWS account.