Real-Time Stream Analytics with GCP Pub/Sub & Dataflow

Question

Pulumi · Accepted Answer

In this Pulumi program, we'll set up the resources required to create a real-time stream analytics pipeline using Google Cloud Platform (GCP). The two main components we will use are Google Cloud Pub/Sub and Google Cloud Dataflow.

1. Google Cloud Pub/Sub: This is a messaging service that allows you to send and receive messages between independent applications. We use Pub/Sub to create a topic where messages (which can be thought of as events or data points) are sent. Subscribers to this topic can then process or analyze these messages in real time.

2. Google Cloud Dataflow: This is a fully managed service for transforming and enriching data in stream (real time) and batch (historical) modes with equal reliability and expressiveness. We use Dataflow to create a job that reads from our Pub/Sub topic, performs some kind of analysis or processing, and then outputs the results.

Let's go through the steps in the program that we're going to write:

- **Setting up a Pub/Sub topic**: This topic will be the entry point for our data stream. Any application or sensor that wants to send data for real-time processing will publish messages to this topic.
- **Creating a Pub/Sub subscription**: A subscription is needed to pull messages from the topic. Although we don't consume messages within this program, the subscription is necessary to hook up Dataflow for processing.
- **Creating a Dataflow job**: The job will be configured to read from the Pub/Sub subscription, perform some transformations or aggregations, and output the results. For the sake of this example, we'll assume a simple pass-through or log processing to avoid complicating the Dataflow job. Typically, a Dataflow job might integrate with BigQuery or another database to store the processed results.

Now, let's begin with the Pulumi program, written in Python:

```python
import pulumi
import pulumi_gcp as gcp

# Create a Pub/Sub topic to receive streaming data.
pubsub_topic = gcp.pubsub.Topic("analytics-topic")

# Create a Pub/Sub subscription to the topic.
pubsub_subscription = gcp.pubsub.Subscription("analytics-subscription",
    topic=pubsub_topic.name)

# Define the Dataflow job that will consume messages from the Pub/Sub subscription.
# Note that you need to provide a template for the job, which is beyond the scope
# of this example. The template defines the actual stream processing logic.
#
# We're using a pass-through example from GCP's public templates for illustration purposes.
dataflow_job = gcp.dataflow.FlexTemplateJob("analytics-dataflow-job",
    container_spec_gcs_path="gs://dataflow-templates/latest/Stream_Deduplicate",
    parameters={
        "inputSubscription": pubsub_subscription.id.apply(lambda id: f"projects/{gcp.config.project}/subscriptions/{id}"),
        "outputTable": "PROJECT_ID:DATASET.TABLE_NAME",
        "windowDuration": "5",
    },
    on_delete="cancel")

# Output the important resources for the user to know about.
pulumi.export("pubsub_topic", pubsub_topic.name)
pulumi.export("pubsub_subscription", pubsub_subscription.name)
pulumi.export("dataflow_job", dataflow_job.name)
```

This program uses Pulumi's `pulumi-gcp` package to create the necessary resources in GCP.

- We create a Pub/Sub topic named `"analytics-topic"`, which will be used to publish the streaming data.
- A subscription named `"analytics-subscription"` is created so that our Dataflow job can pull messages from the Pub/Sub topic.
- The `gcp.dataflow.FlexTemplateJob` resource is used to set up a Dataflow pipeline. The `container_spec_gcs_path` points to a GCS path where the job's container spec can be found. This spec contains the details about the Dataflow streaming job. The `parameters` field provides the necessary parameters for the Dataflow job. They include the subscription to read from, the BigQuery table to write to, and other job-specific parameters.

The `on_delete="cancel"` parameter ensures that the Dataflow job is canceled when the Pulumi resource is deleted, which helps clean up resources to prevent continued billing.

After this program is run, the exported values will give you the names of the Pub/Sub topic, the subscription, and the Dataflow job. These can all be accessed and managed using the Google Cloud console.

Remember to replace `"PROJECT_ID:DATASET.TABLE_NAME"` with your actual project ID, dataset, and table name where you want the processed data to be stored. This Dataflow job is configured to deduplicate the stream data using a given window duration, defined by `"windowDuration"`. This can be adjusted according to your specific use case.

Note that this code is a template for starting real-time analytics and should be customized to fit the exact needs of your stream processing, including the job template and any additional resources or configurations your particular scenario requires.