Live Sentiment Analysis Data Pipeline using AWS Kinesis Firehose

Question

Pulumi · Accepted Answer

To create a live sentiment analysis data pipeline using AWS Kinesis Firehose, several AWS services will be brought together to collect, process, and analyze streaming data in real-time. AWS Kinesis services are designed to handle large-scale streaming data use cases, such as real-time analytics and live sentiment analysis.

Here's a high-level view of what we will create:

1. **Kinesis Data Firehose Delivery Stream**: This will be the entry point for data. It will capture and automatically load streaming data into AWS services such as S3, Redshift, Elasticsearch, or Splunk.

2. **AWS Lambda Function**: We will attach a Lambda function to the Firehose delivery stream. This function will process the incoming streaming data and carry out sentiment analysis in real-time before the data makes it to its final destination.

3. **Amazon S3 Bucket**: Data needs to be stored somewhere after being processed. An S3 bucket will serve as the default destination for processed data.

4. **IAM Roles and Policies**: Proper IAM roles and policies are necessary for allowing Kinesis Firehose to invoke the Lambda function and for the latter to write the results to S3.

Here is the detailed Pulumi program that sets up each of these components in AWS using Python:

```python
import pulumi
import pulumi_aws as aws

# Create IAM role for Firehose - allows the service to assume the role
firehose_role = aws.iam.Role("firehose_role",
                             assume_role_policy="""{
                                 "Version": "2012-10-17",
                                 "Statement": [{
                                     "Effect": "Allow",
                                     "Principal": {"Service": "firehose.amazonaws.com"},
                                     "Action": "sts:AssumeRole"
                                 }]
                             }""")

# Attach policies to the role that allow it to write to S3 and invoke Lambda
s3_policy = aws.iam.RolePolicyAttachment("firehose_s3_policy",
                                         role=firehose_role.name,
                                         policy_arn="arn:aws:iam::aws:policy/AmazonS3FullAccess")

lambda_policy = aws.iam.RolePolicyAttachment("firehose_lambda_policy",
                                             role=firehose_role.name,
                                             policy_arn="arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole")

# Create an S3 bucket to store the processed data
s3_bucket = aws.s3.Bucket("s3_bucket")

# Create a Lambda function that will process the streaming data
sentiment_analysis_lambda = aws.lambda_.Function("sentiment_analysis",
                                                 role=firehose_role.arn,
                                                 runtime="python3.8",
                                                 handler="index.handler",
                                                 code=pulumi.AssetArchive({
                                                     ".": pulumi.FileArchive("./lambda")  # Assuming Lambda code is in the ./lambda folder
                                                 }))
                                                 
# Create a Firehose Delivery Stream, with S3 as the destination
firehose_stream = aws.kinesis.FirehoseDeliveryStream("firehose_stream",
                                                     destination="s3",
                                                     s3_configuration={
                                                         "role_arn": firehose_role.arn,
                                                         "bucket_arn": s3_bucket.arn
                                                     },
                                                     extended_s3_configuration={
                                                         "role_arn": firehose_role.arn,
                                                         "bucket_arn": s3_bucket.arn,
                                                         "buffering_size": 10,  # Buffering size in MBs (1-128)
                                                         "buffering_interval": 900,  # Buffering interval in seconds (60-900)
                                                         "data_format_conversion_configuration": {
                                                             "enabled": True,
                                                             "input_format_configuration": {
                                                                 "deserializer": {
                                                                     "open_x_json_ser_de": {}
                                                                 }
                                                             },
                                                             "output_format_configuration": {
                                                                 "serializer": {
                                                                     "parquet_ser_de": {}
                                                                 }
                                                             },
                                                             "schema_configuration": {
                                                                 "database_name": "mydatabase",  # You have to replace this with a Glue database you have setup
                                                                 "role_arn": firehose_role.arn,
                                                                 "table_name": "mytable",  # and a table in the Glue database
                                                                 "region": "us-west-2",  # Region should match your Glue database
                                                             }
                                                         },
                                                         "processing_configuration": {
                                                             "enabled": True,
                                                             "processors": [
                                                                 {
                                                                     "type": "Lambda",
                                                                     "parameters": [
                                                                         {
                                                                             "parameter_name": "LambdaArn",
                                                                             "parameter_value": sentiment_analysis_lambda.arn
                                                                         }
                                                                     ]
                                                                 }
                                                             ]
                                                         }
                                                     })

# Output the name of the bucket and the ARN of the stream
pulumi.export('bucket_name', s3_bucket.bucket)
pulumi.export('firehose_stream_arn', firehose_stream.arn)
```

This program integrates AWS Lambda with Kinesis Firehose for real-time data processing. The Lambda function specified here (`sentiment_analysis`) will be triggered by Kinesis Data Firehose, process incoming streaming data via the handler `index.handler`, apply sentiment analysis, and afterwards, data will be stored in an Amazon S3 bucket.

To use this Pulumi program, replace the `mydatabase` and `mytable` placeholders with your AWS Glue catalog database and corresponding table names, and ensure that the AWS region matches the one in which your Glue database is located. Also, create a `./lambda` directory containing your Lambda code for sentiment analysis that reads the incoming records from Kinesis, analyzes the sentiment, and returns the transformed records with sentiment information added.

After you've set up the code as above, running `pulumi up` with the Pulumi CLI will create the resources specified in AWS. You can then start streaming your data into the Kinesis Data Firehose delivery stream to perform live sentiment analysis.