1. Answers
  2. Checkpointing LLM Training Progress in S3

How do I checkpoint LLM training progress in S3?

In this guide, we will demonstrate how to use Pulumi to checkpoint the progress of training a large language model (LLM) by saving checkpoints to an AWS S3 bucket. This allows you to store and retrieve training progress efficiently and reliably.

We will:

  1. Create an S3 bucket to store the checkpoints.
  2. Configure the bucket with versioning to keep track of different checkpoint versions.
  3. Provide the necessary IAM roles and policies to allow access to the bucket.
import * as pulumi from "@pulumi/pulumi";
import * as aws from "@pulumi/aws";

// Create an S3 bucket to store LLM training checkpoints
const bucket = new aws.s3.Bucket("llm-checkpoints", {
    versioning: {
        enabled: true, // Enable versioning to keep track of different checkpoint versions
    },
    tags: {
        Environment: "dev",
        Project: "LLM Training",
    },
});

// Define a policy document for read/write access to the S3 bucket
const bucketPolicy = new aws.iam.Policy("bucketPolicy", {
    policy: bucket.arn.apply(arn => JSON.stringify({
        Version: "2012-10-17",
        Statement: [
            {
                Effect: "Allow",
                Action: [
                    "s3:PutObject",
                    "s3:GetObject",
                    "s3:ListBucket",
                ],
                Resource: [
                    `${arn}`,
                    `${arn}/*`,
                ],
            },
        ],
    })),
});

// Create an IAM role that can be assumed by the training job
const trainingJobRole = new aws.iam.Role("trainingJobRole", {
    assumeRolePolicy: {
        Version: "2012-10-17",
        Statement: [
            {
                Effect: "Allow",
                Principal: {
                    Service: "sagemaker.amazonaws.com",
                },
                Action: "sts:AssumeRole",
            },
        ],
    },
});

// Attach the policy to the role
new aws.iam.RolePolicyAttachment("trainingJobPolicyAttachment", {
    role: trainingJobRole.name,
    policyArn: bucketPolicy.arn,
});

// Export the bucket name and IAM role ARN for use in training jobs
export const bucketName = bucket.bucket;
export const trainingJobRoleArn = trainingJobRole.arn;

Key Points

  • S3 Bucket: Created to store LLM training checkpoints with versioning enabled.
  • IAM Policy: Configured to allow read/write access to the S3 bucket.
  • IAM Role: Created and attached to the policy to be assumed by the training job.

Summary

In this guide, we created an S3 bucket with versioning to store LLM training checkpoints. We also set up IAM policies and roles to allow the training job to access the bucket, ensuring that the training progress can be checkpointed and retrieved efficiently.

Deploy this code

Want to deploy this code? Sign up for a free Pulumi account to deploy in a few clicks.

Sign up

New to Pulumi?

Want to deploy this code? Sign up with Pulumi to deploy in a few clicks.

Sign up