How do I checkpoint LLM training progress in S3?
In this guide, we will demonstrate how to use Pulumi to checkpoint the progress of training a large language model (LLM) by saving checkpoints to an AWS S3 bucket. This allows you to store and retrieve training progress efficiently and reliably.
We will:
- Create an S3 bucket to store the checkpoints.
- Configure the bucket with versioning to keep track of different checkpoint versions.
- Provide the necessary IAM roles and policies to allow access to the bucket.
import * as pulumi from "@pulumi/pulumi";
import * as aws from "@pulumi/aws";
// Create an S3 bucket to store LLM training checkpoints
const bucket = new aws.s3.Bucket("llm-checkpoints", {
versioning: {
enabled: true, // Enable versioning to keep track of different checkpoint versions
},
tags: {
Environment: "dev",
Project: "LLM Training",
},
});
// Define a policy document for read/write access to the S3 bucket
const bucketPolicy = new aws.iam.Policy("bucketPolicy", {
policy: bucket.arn.apply(arn => JSON.stringify({
Version: "2012-10-17",
Statement: [
{
Effect: "Allow",
Action: [
"s3:PutObject",
"s3:GetObject",
"s3:ListBucket",
],
Resource: [
`${arn}`,
`${arn}/*`,
],
},
],
})),
});
// Create an IAM role that can be assumed by the training job
const trainingJobRole = new aws.iam.Role("trainingJobRole", {
assumeRolePolicy: {
Version: "2012-10-17",
Statement: [
{
Effect: "Allow",
Principal: {
Service: "sagemaker.amazonaws.com",
},
Action: "sts:AssumeRole",
},
],
},
});
// Attach the policy to the role
new aws.iam.RolePolicyAttachment("trainingJobPolicyAttachment", {
role: trainingJobRole.name,
policyArn: bucketPolicy.arn,
});
// Export the bucket name and IAM role ARN for use in training jobs
export const bucketName = bucket.bucket;
export const trainingJobRoleArn = trainingJobRole.arn;
Key Points
- S3 Bucket: Created to store LLM training checkpoints with versioning enabled.
- IAM Policy: Configured to allow read/write access to the S3 bucket.
- IAM Role: Created and attached to the policy to be assumed by the training job.
Summary
In this guide, we created an S3 bucket with versioning to store LLM training checkpoints. We also set up IAM policies and roles to allow the training job to access the bucket, ensuring that the training progress can be checkpointed and retrieved efficiently.
Deploy this code
Want to deploy this code? Sign up for a free Pulumi account to deploy in a few clicks.
Sign upNew to Pulumi?
Want to deploy this code? Sign up with Pulumi to deploy in a few clicks.
Sign upThank you for your feedback!
If you have a question about how to use Pulumi, reach out in Community Slack.
Open an issue on GitHub to report a problem or suggest an improvement.