Real-time Inference Data Quality Analysis with Evidently on AWS

Question

Pulumi · Accepted Answer

Real-time inference data quality analysis is an essential part of any machine learning or data-driven application where one needs to constantly evaluate the model's performance based on incoming data. AWS has several services that can aid in setting up this workflow, with Amazon Evidently being one of the key services for such an analysis.

Amazon Evidently is a service that enables you to run experiments to optimize your application and to understand the causal impact of changes. It provides you with the tools to make feature evaluations, do A/B testing, and analyze the operational performance of your features.

Here's how you would set up a basic Evidently project using Pulumi, an infrastructure as code tool, with AWS. Our goal will be to create a simple project, feature, and launch configuration that you could expand upon for more elaborate real-time data analysis:

1. **Project**: The container for everything you do in Evidently. You will create projects to organize your features.
2. **Feature**: Represents a specific aspect or component of your application that you want to evaluate or run experiments on.
3. **Launch**: Allows you to safely introduce new features and monitor their performance against a control group.

Below is the Pulumi Python program that creates these AWS Evidently resources:

```python
import pulumi
import pulumi_aws as aws

# Creating an Evidently project to organize the features and launches
evidently_project = aws.evidently.Project("dataQualityAnalysisProject",
    description="Project for monitoring real-time inference data quality")

# Defining a feature within the Evidently project for evaluation
evidently_feature = aws.evidently.Feature("inferenceQualityFeature",
    project=evidently_project.name,
    variations=[
        aws.evidently.FeatureVariationArgs(
            # This variation represents the feature configuration A
            name="A",
            value=aws.evidently.FeatureVariationValueArgs(
                bool_value=False # Using boolean values for simplicity
            )
        ),
        aws.evidently.FeatureVariationArgs(
            # This variation represents the feature configuration B
            name="B",
            value=aws.evidently.FeatureVariationValueArgs(
                bool_value=True
            )
        )
    ],
    default_variation="A", # Setting feature configuration A as the default
    description="Feature to analyze data quality of real-time inference")

# Setting up a launch to analyze the feature's performance in a real-world setting
evidently_launch = aws.evidently.Launch("inferenceQualityLaunch",
    project=evidently_project.name,
    groups=[
        aws.evidently.LaunchGroupArgs(
            name="TestGroup",
            feature=evidently_feature.name,
            variation="B", # Testing the feature configuration B
            description="Group to test the new configuration"
        )
    ],
    scheduled_splits_config=aws.evidently.LaunchScheduledSplitsConfigArgs(
        steps=[
            aws.evidently.LaunchScheduledSplitsConfigStepArgs(
                start_time=pulumi.Output.from_input("2024-01-01T00:00:00Z"), # Scheduling the launch to start at a specific time
                group_weights={
                    "TestGroup": 100, # Sending 100% of traffic to TestGroup
                }
            )
        ]
    ),
    description="Launch to measure the effectiveness of inference data quality configuration B"
)

# Exporting the project name, feature name, and launch name for easy reference
pulumi.export("project_name", evidently_project.name)
pulumi.export("feature_name", evidently_feature.name)
pulumi.export("launch_name", evidently_launch.name)
```

In this program, you create a Pulumi project with an Evidently project inside it. Inside the Evidently project, you define a feature that represents a particular aspect of your application, and you set multiple variations of this feature. This is useful for real-time analysis as you can test different configurations of your data quality checks.

Next, you create a launch configuration to release this feature in a controlled manner to a subgroup of your audience and monitor its performance. In the Pulumi program, you schedule 100% of the traffic to go to the `TestGroup`. In a real-world scenario, you might want to direct only a fraction of the traffic to the test group and compare it with a control group.

The exported values will give you a reference to the names of the created project, feature, and launch, which you can use through the Pulumi CLI or in the AWS Console to identify your resources.

Remember that while this example sets up the necessary AWS Evidently resources, integrating them with your application and setting up real-time data pipelines and analysis will require additional AWS services like AWS Kinesis for streaming data and Lambda for computing.

Pulumi's strength lies in its ability to manage and orchestrate such resources as code, leading to a more reliable, repeatable, and transparent deployment process.