Spanner for ML Feature Store at Scale

Question

Pulumi · Accepted Answer

Google Cloud Spanner is a fully managed, horizontally scalable, globally distributed, relational database service that is designed to handle large-scale transactional workloads. It's an excellent choice for an ML feature store due to its high availability and strong consistency.

In the context of machine learning (ML), a feature store is a centralized repository for storing and serving features used by machine learning models. A feature store helps in ensuring that the features used during training are consistent with those used during inference, thus preventing training-serving skew.

To set up an ML feature store at scale using Cloud Spanner, we'll need to create a Spanner instance, configure it, and then create a database within that instance to hold our feature data.

Below is the Pulumi program written in Python that accomplishes this:

1. It creates a new Cloud Spanner instance with the desired configuration for ML feature store usage.
2. Defines an instance config that specifies the region and replication for high availability.
3. It then creates a new Spanner database within this instance for storing feature data.

```python
import pulumi
import pulumi_gcp as gcp

# Create a Spanner instance
spanner_instance = gcp.spanner.Instance("ml-feature-store-instance",
    config="regional-us-central1", # Choose your region as per your requirement
    display_name="ML Feature Store Instance",
    labels={
        "env": "production",
    },
    num_nodes=1 # You can adjust the number of nodes based on your workload
)

# Create a Spanner Database
spanner_database = gcp.spanner.Database("ml-feature-store-database",
    instance=spanner_instance.name,
    database_dialect="GOOGLE_STANDARD_SQL", # Standard SQL is recommended, but you can use PostgreSQL if needed.
    ddl_statements=[
        """
        CREATE TABLE Features (
            FeatureId STRING(36) NOT NULL,
            FeatureName STRING(255) NOT NULL,
            ValueType STRING(36) NOT NULL,
            Metadata STRING(MAX) NOT NULL,
        ) PRIMARY KEY (FeatureId)
        """,
        """
        CREATE TABLE FeatureValues (
            FeatureId STRING(36) NOT NULL,
            EntityId STRING(255) NOT NULL,
            Timestamp TIMESTAMP NOT NULL OPTIONS (allow_commit_timestamp=true),
            Value BYTES(MAX) NOT NULL,
        ) PRIMARY KEY (FeatureId, EntityId, Timestamp),
        INTERLEAVE IN PARENT Features ON DELETE CASCADE
        """
    ]
)

# Export the Spanner instance and database names
pulumi.export('spanner_instance_name', spanner_instance.name)
pulumi.export('spanner_database_name', spanner_database.name)
```

This program does the following:
- Establishes a new instance of Cloud Spanner, selected in a regional location to provide solid performance and high availability.
- Labels the Spanner instance with the environment set to 'production', though you can customize this to match your deployment environment.
- Allocates one node for the Spanner instance, which can be scaled to meet the demand.
- Creates the Spanner database schema with two tables, `Features` and `FeatureValues`. The `Features` table can store metadata about each feature, while the `FeatureValues` table stores the actual feature data.

Please make sure to replace the `config` property value with the suitable region that fits your application's requirements and adjust `num_nodes` depending on your needed capacity and performance.

Note that this is a simple setup, and for a production-ready ML feature store, more considerations around security, backup, and finer schema design are needed, which Pulumi can also help orchestrate.