Scalable Serving of Model Inferences with DigitalOcean Load Balancer

Question

Pulumi · Accepted Answer

To set up a scalable infrastructure for serving model inferences with DigitalOcean, you can leverage their Droplets as compute instances that host your model serving applications, a Load Balancer to distribute traffic across those instances, and potentially Floating IPs for a static entry point to your infrastructure.

DigitalOcean's Load Balancers ensure that incoming traffic is distributed across your Droplets evenly, which helps with handling high availability and fault tolerance. If a droplet becomes unhealthy, the Load Balancer will automatically reroute the traffic to the remaining healthy instances. To maintain a truly scalable system, you can use Droplets with the monitoring option enabled, which allows you to collect metrics and set alerts or automate scaling actions.

You will need to ensure that your Droplets are set up with your model serving application, which could be a web service responding to HTTP requests with model inferences.

Here's a Pulumi Python program that creates this entire setup:

1. A set of Droplets running a model serving application.
2. A Load Balancer configured to distribute incoming traffic to these Droplets.
3. (Optional) A domain with an SSL certificate to secure connections to your Load Balancer.
4. Health checks to ensure traffic is only sent to healthy Droplets.

```python
import pulumi
import pulumi_digitalocean as digitalocean

# Create a DigitalOcean Load Balancer to distribute incoming traffic across multiple Droplets.
load_balancer = digitalocean.LoadBalancer("model-serving-lb",
    region="nyc3",  # You can choose a different region depending on your requirements.
    algorithm="least_connections",  # We choose least_connections to balance the load evenly.
    forwarding_rules=[
        # Rules to forward HTTP traffic from port 80 on the Load Balancer to port 80 on the Droplets.
        # The actual ports depend on the configuration of your model serving application.
        digitalocean.LoadBalancerForwardingRuleArgs(
            entry_protocol="http",
            entry_port=80,
            target_protocol="http",
            target_port=80
        ),
        # Add additional rules if you need to handle more protocols or different ports.
    ],
    healthcheck=digitalocean.LoadBalancerHealthcheckArgs(
        port=80,
        protocol="http",
        check_interval_seconds=10,
        response_timeout_seconds=5,
        unhealthy_threshold=3,
        healthy_threshold=3
    ),
)

# Define the number and configuration of Droplets to provision
droplet_count = 3  # Start with 3 droplets for redundancy, can scale later based on the load.
droplet_ids = []  # We'll fill this list with the IDs of the Droplets we'll create.

# Loop to create multiple Droplets.
for i in range(droplet_count):
    droplet = digitalocean.Droplet(f"model-serving-droplet-{i}",
        image="docker-18-04",  # A Docker image to run containers, change as needed.
        region="nyc3",  # Keep consistent with the Load Balancer's region.
        size="s-1vcpu-1gb",  # Size of the Droplet, choose based on your model requirements.
        private_networking=True,  # Enable private networking for internal communications.
        monitoring=True,  # Enable monitoring to gather Droplet metrics.
        ssh_keys=[YOUR_SSH_KEY_ID],  # Replace with your actual SSH key ID for secure access.
        tags=["model-serving"]  # A tag to identify all Droplets under the same workload.
    )
    droplet_ids.append(droplet.id)

# Attach the Droplets to the Load Balancer
attach_droplets_to_lb = load_balancer.droplet_ids.apply(
    lambda _: droplet_ids
)

# Optional: Create a domain and manage the DNS records for your application.
domain = digitalocean.Domain("model-serving-domain",
    name="modelserving.example.com",  # Replace with your domain.
    ip_address=load_balancer.ip  # The Load Balancer's IP address to point the domain to it.
)

# Optional: Create a DigitalOcean Certificate for HTTPS traffic, if you have HTTPS configured.
certificate = digitalocean.Certificate("model-serving-cert",
    name="model-serving-certificate",
    type="lets_encrypt",
    domains=["modelserving.example.com"]
)

# Add an HTTPS forwarding rule to the load balancer if you're using a certificate.
load_balancer.update(
    forwarding_rules=[
        # Existing forwarding rules go here.

# New rule to forward HTTPS traffic from port 443 on the Load Balancer to port 443 on the Droplets.
        digitalocean.LoadBalancerForwardingRuleArgs(
            entry_protocol="https",
            entry_port=443,
            target_protocol="https",
            target_port=443,
            certificate_id=certificate.id
        ),
    ]
)

# Export the Load Balancer IP and the domain name.
pulumi.export("load_balancer_ip", load_balancer.ip)
pulumi.export("domain", domain.name)
```

Please replace `YOUR_SSH_KEY_ID` with your actual SSH key ID, which you can obtain from your DigitalOcean cloud control panel.

This code sets up a scalable serving system for your model. The Droplets run your application and are created with monitoring enabled to assist in autoscaling and metric collection. The Load Balancer routes and balances the incoming traffic. Optionally, if you use a domain and an SSL certificate, you can have a user-friendly URL with secure connections.

As users send requests to your system, the Load Balancer will distribute them across the healthy Droplets, ensuring even load distribution and high availability. If any Droplet fails its health checks, it will no longer receive traffic until it's healthy again. By monitoring and responding to load, you can scale your Droplets up or down manually or even automate this process based on the metrics you collect.