Auto-Scaling Inference Endpoints for LLMs
Introduction
In this guide, we will set up auto-scaling inference endpoints for Large Language Models (LLMs) using Pulumi and AWS. The key services involved include AWS Lambda, API Gateway, and Auto Scaling Groups.
Step-by-Step Explanation
Step 1: Set up the VPC
First, we need to create a Virtual Private Cloud (VPC) to host our resources. This includes creating subnets, route tables, and internet gateways.
Step 2: Create an Auto Scaling Group
Next, we will create an Auto Scaling Group to manage the EC2 instances that will host our inference endpoints. We will configure the Auto Scaling Group to scale based on CPU utilization.
Step 3: Set up an Application Load Balancer
We will create an Application Load Balancer (ALB) to distribute traffic to the EC2 instances in our Auto Scaling Group. The ALB will be configured to listen on port 80 and forward requests to the instances.
Step 4: Deploy the Inference Service
We will deploy our LLM inference service on the EC2 instances. This involves creating an EC2 launch template with the necessary configurations and installing the required software.
Step 5: Set up API Gateway
Finally, we will create an API Gateway to expose our inference endpoints. The API Gateway will be configured to route requests to the Application Load Balancer.
Conclusion
By following these steps, we have set up auto-scaling inference endpoints for LLMs using Pulumi and AWS. This setup ensures that our inference service can handle varying levels of traffic efficiently and cost-effectively.
Full Code Example
import * as pulumi from "@pulumi/pulumi";
import * as aws from "@pulumi/aws";
// Step 1: Set up the VPC
const vpc = new aws.ec2.Vpc("vpc", {
cidrBlock: "10.0.0.0/16",
});
const subnet = new aws.ec2.Subnet("subnet", {
vpcId: vpc.id,
cidrBlock: "10.0.1.0/24",
availabilityZone: "us-west-2a",
});
const internetGateway = new aws.ec2.InternetGateway("internetGateway", {
vpcId: vpc.id,
});
const routeTable = new aws.ec2.RouteTable("routeTable", {
vpcId: vpc.id,
});
const routeToInternet = new aws.ec2.Route("routeToInternet", {
routeTableId: routeTable.id,
destinationCidrBlock: "0.0.0.0/0",
gatewayId: internetGateway.id,
});
const routeTableAssociation = new aws.ec2.RouteTableAssociation("routeTableAssociation", {
subnetId: subnet.id,
routeTableId: routeTable.id,
});
const securityGroup = new aws.ec2.SecurityGroup("securityGroup", {
vpcId: vpc.id,
ingress: [{
protocol: "tcp",
fromPort: 80,
toPort: 80,
cidrBlocks: ["0.0.0.0/0"],
}],
egress: [{
protocol: "-1",
fromPort: 0,
toPort: 0,
cidrBlocks: ["0.0.0.0/0"],
}],
});
// Step 2: Create an Auto Scaling Group
const launchTemplate = new aws.ec2.LaunchTemplate("launchTemplate", {
imageId: "ami-0c55b159cbfafe1f0", // Amazon Linux 2 AMI
instanceType: "t2.micro",
securityGroupNames: [securityGroup.name],
userData: \`#!/bin/bash
sudo yum update -y
sudo yum install -y docker
sudo service docker start
sudo usermod -a -G docker ec2-user
docker run -d -p 80:80 my-llm-inference-service\`,
});
const targetGroup = new aws.lb.TargetGroup("targetGroup", {
port: 80,
protocol: "HTTP",
vpcId: vpc.id,
targetType: "instance",
});
const autoScalingGroup = new aws.autoscaling.Group("autoScalingGroup", {
vpcZoneIdentifiers: [subnet.id],
launchTemplate: {
id: launchTemplate.id,
version: "$Latest",
},
minSize: 1,
maxSize: 5,
desiredCapacity: 2,
targetGroupArns: [targetGroup.arn],
});
const scalingPolicy = new aws.autoscaling.Policy("scalingPolicy", {
autoscalingGroupName: autoScalingGroup.name,
policyType: "TargetTrackingScaling",
targetTrackingConfiguration: {
predefinedMetricSpecification: {
predefinedMetricType: "ASGAverageCPUUtilization",
},
targetValue: 50.0,
},
});
// Step 3: Set up an Application Load Balancer
const loadBalancer = new aws.lb.LoadBalancer("loadBalancer", {
internal: false,
securityGroups: [securityGroup.id],
subnets: [subnet.id],
});
const listener = new aws.lb.Listener("listener", {
loadBalancerArn: loadBalancer.arn,
port: 80,
defaultActions: [{
type: "forward",
targetGroupArn: targetGroup.arn,
}],
});
// Step 4: Deploy the Inference Service
// (Already included in the launch template user data)
// Step 5: Set up API Gateway
const api = new aws.apigatewayv2.Api("api", {
protocolType: "HTTP",
});
const integration = new aws.apigatewayv2.Integration("integration", {
apiId: api.id,
integrationType: "HTTP_PROXY",
integrationUri: loadBalancer.dnsName.apply(dnsName => \`http://\${dnsName}\`),
});
const apiRoute = new aws.apigatewayv2.Route("apiRoute", {
apiId: api.id,
routeKey: "GET /{proxy+}",
target: pulumi.interpolate\`integrations/\${integration.id}\`,
});
const stage = new aws.apigatewayv2.Stage("stage", {
apiId: api.id,
autoDeploy: true,
name: "$default",
});
export const vpcId = vpc.id;
export const loadBalancerDnsName = loadBalancer.dnsName;
export const apiEndpoint = stage.invokeUrl;
Deploy this code
Want to deploy this code? Sign up for a free Pulumi account to deploy in a few clicks.
Sign upNew to Pulumi?
Want to deploy this code? Sign up with Pulumi to deploy in a few clicks.
Sign upThank you for your feedback!
If you have a question about how to use Pulumi, reach out in Community Slack.
Open an issue on GitHub to report a problem or suggest an improvement.