Auto-Scaling Inference Endpoints for LLMs

Introduction

In this guide, we will set up auto-scaling inference endpoints for Large Language Models (LLMs) using Pulumi and AWS. The key services involved include AWS Lambda, API Gateway, and Auto Scaling Groups.

Step-by-Step Explanation

Step 1: Set up the VPC

First, we need to create a Virtual Private Cloud (VPC) to host our resources. This includes creating subnets, route tables, and internet gateways.

Step 2: Create an Auto Scaling Group

Next, we will create an Auto Scaling Group to manage the EC2 instances that will host our inference endpoints. We will configure the Auto Scaling Group to scale based on CPU utilization.

Step 3: Set up an Application Load Balancer

We will create an Application Load Balancer (ALB) to distribute traffic to the EC2 instances in our Auto Scaling Group. The ALB will be configured to listen on port 80 and forward requests to the instances.

Step 4: Deploy the Inference Service

We will deploy our LLM inference service on the EC2 instances. This involves creating an EC2 launch template with the necessary configurations and installing the required software.

Step 5: Set up API Gateway

Finally, we will create an API Gateway to expose our inference endpoints. The API Gateway will be configured to route requests to the Application Load Balancer.

Conclusion

By following these steps, we have set up auto-scaling inference endpoints for LLMs using Pulumi and AWS. This setup ensures that our inference service can handle varying levels of traffic efficiently and cost-effectively.

Full Code Example

import * as pulumi from "@pulumi/pulumi";
import * as aws from "@pulumi/aws";

// Step 1: Set up the VPC
const vpc = new aws.ec2.Vpc("vpc", {
    cidrBlock: "10.0.0.0/16",
});

const subnet = new aws.ec2.Subnet("subnet", {
    vpcId: vpc.id,
    cidrBlock: "10.0.1.0/24",
    availabilityZone: "us-west-2a",
});

const internetGateway = new aws.ec2.InternetGateway("internetGateway", {
    vpcId: vpc.id,
});

const routeTable = new aws.ec2.RouteTable("routeTable", {
    vpcId: vpc.id,
});

const routeToInternet = new aws.ec2.Route("routeToInternet", {
    routeTableId: routeTable.id,
    destinationCidrBlock: "0.0.0.0/0",
    gatewayId: internetGateway.id,
});

const routeTableAssociation = new aws.ec2.RouteTableAssociation("routeTableAssociation", {
    subnetId: subnet.id,
    routeTableId: routeTable.id,
});

const securityGroup = new aws.ec2.SecurityGroup("securityGroup", {
    vpcId: vpc.id,
    ingress: [{
        protocol: "tcp",
        fromPort: 80,
        toPort: 80,
        cidrBlocks: ["0.0.0.0/0"],
    }],
    egress: [{
        protocol: "-1",
        fromPort: 0,
        toPort: 0,
        cidrBlocks: ["0.0.0.0/0"],
    }],
});

// Step 2: Create an Auto Scaling Group
const launchTemplate = new aws.ec2.LaunchTemplate("launchTemplate", {
    imageId: "ami-0c55b159cbfafe1f0", // Amazon Linux 2 AMI
    instanceType: "t2.micro",
    securityGroupNames: [securityGroup.name],
    userData: \`#!/bin/bash
sudo yum update -y
sudo yum install -y docker
sudo service docker start
sudo usermod -a -G docker ec2-user
docker run -d -p 80:80 my-llm-inference-service\`,
});

const targetGroup = new aws.lb.TargetGroup("targetGroup", {
    port: 80,
    protocol: "HTTP",
    vpcId: vpc.id,
    targetType: "instance",
});

const autoScalingGroup = new aws.autoscaling.Group("autoScalingGroup", {
    vpcZoneIdentifiers: [subnet.id],
    launchTemplate: {
        id: launchTemplate.id,
        version: "$Latest",
    },
    minSize: 1,
    maxSize: 5,
    desiredCapacity: 2,
    targetGroupArns: [targetGroup.arn],
});

const scalingPolicy = new aws.autoscaling.Policy("scalingPolicy", {
    autoscalingGroupName: autoScalingGroup.name,
    policyType: "TargetTrackingScaling",
    targetTrackingConfiguration: {
        predefinedMetricSpecification: {
            predefinedMetricType: "ASGAverageCPUUtilization",
        },
        targetValue: 50.0,
    },
});

// Step 3: Set up an Application Load Balancer
const loadBalancer = new aws.lb.LoadBalancer("loadBalancer", {
    internal: false,
    securityGroups: [securityGroup.id],
    subnets: [subnet.id],
});

const listener = new aws.lb.Listener("listener", {
    loadBalancerArn: loadBalancer.arn,
    port: 80,
    defaultActions: [{
        type: "forward",
        targetGroupArn: targetGroup.arn,
    }],
});

// Step 4: Deploy the Inference Service
// (Already included in the launch template user data)

// Step 5: Set up API Gateway
const api = new aws.apigatewayv2.Api("api", {
    protocolType: "HTTP",
});

const integration = new aws.apigatewayv2.Integration("integration", {
    apiId: api.id,
    integrationType: "HTTP_PROXY",
    integrationUri: loadBalancer.dnsName.apply(dnsName => \`http://\${dnsName}\`),
});

const apiRoute = new aws.apigatewayv2.Route("apiRoute", {
    apiId: api.id,
    routeKey: "GET /{proxy+}",
    target: pulumi.interpolate\`integrations/\${integration.id}\`,
});

const stage = new aws.apigatewayv2.Stage("stage", {
    apiId: api.id,
    autoDeploy: true,
    name: "$default",
});

export const vpcId = vpc.id;
export const loadBalancerDnsName = loadBalancer.dnsName;
export const apiEndpoint = stage.invokeUrl;

Deploy this code

Want to deploy this code? Sign up for a free Pulumi account to deploy in a few clicks.

New to Pulumi?

Want to deploy this code? Sign up with Pulumi to deploy in a few clicks.