EC2 Spot Instance Fleet for Cost-Effective AI Experiments
PythonCreating an EC2 Spot Instance Fleet can be an excellent way to optimize the costs related to compute-intensive AI experiments. By leveraging spot instances, you're able to bid for spare computing capacity at lower prices than the standard EC2 pricing.
To achieve this in Pulumi using Python, we’ll use the
aws.ec2.SpotFleetRequest
resource from thepulumi_aws
package. A Spot Fleet Request enables you to manage a fleet of Spot Instances from various instance types and availability zones, helping ensure that your capacity needs are met even if some spot instance types may become unavailable due to price or capacity.Here's the step-by-step guide of what the below Pulumi Python program does:
-
Define a Spot Fleets Configuration: You define your desired configurations for the spot fleet, such as target capacity, instance type, and the maximum price you're willing to pay.
-
Create IAM Roles: A necessary IAM role is created to grant the Spot Fleet the permissions to launch and manage EC2 instances on your behalf.
-
Create the Spot Fleet Request: You submit your spot fleet request to AWS, which launches instances according to your defined specifications and pricing strategy.
-
Exporting Outputs: Finally, you export any helpful outputs, such as the spot fleet request id, which can be used for monitoring or managing your fleet outside of Pulumi.
Below is the Pulumi Python program that implements an EC2 Spot Instance Fleet for AI Experiments:
import pulumi import pulumi_aws as aws # Define the IAM role and attach an AWS managed policy for EC2 Spot Fleet. spot_fleet_role = aws.iam.Role("spotFleetRole", assume_role_policy="""{ "Version": "2012-10-17", "Statement": [{ "Effect": "Allow", "Principal": {"Service": "spotfleet.amazonaws.com"}, "Action": "sts:AssumeRole" }] }""") spot_fleet_attached_policy = aws.iam.RolePolicyAttachment("spotFleetAttachedPolicy", role=spot_fleet_role.name, policy_arn="arn:aws:iam::aws:policy/service-role/AmazonEC2SpotFleetTaggingRole") # Spot fleet request configuration. Adjust your settings as needed for your specific use case. spot_fleet_request_config = aws.ec2.SpotFleetRequest("aiSpotFleet", iam_fleet_role=spot_fleet_role.arn, target_capacity=10, spot_price="0.03", allocation_strategy="lowestPrice", # Choose an allocation strategy suitable for your use case. launch_specifications=[{ "ami": "ami-0a313d6098716f372", # Replace with an AI/ML AMI of your choice. "instance_type": "m5.large", # Choose an instance type that fits your AI workload. "spot_price": "0.03", "key_name": "my-keypair", # Replace with your key pair for SSH access. # Additional settings can include VPC, subnet, EBS configurations, tags, etc. }], # Other optional settings like valid_until, replace_unhealthy_instances, and load_balancers can be defined. wait_for_fulfillment=True, # Waits for the spot fleet request to be fulfilled. # Set this to False for asynchronous fulfillment which is usually recommended. ) # Output the Spot Fleet Request ID. pulumi.export("spot_fleet_request_id", spot_fleet_request_config.id)
This program initiates a spot fleet request configured to create a fleet for AI experiments. Modify values like
ami
,instance_type
,spot_price
, andkey_name
to suit your requirements. Please make sure you have proper permissions and configurations set up in your AWS account before running this program. Forami
, use an AMI that has AI/ML tools and libraries pre-installed or one that you've configured with the necessary software for your experiments.Remember to confirm your region supports your chosen instances and has sufficient spot capacity. You also need a key pair for secure access to the instances. Before running the program, you'll need the Pulumi CLI installed and AWS access configured locally.
After you deploy this program with Pulumi, AWS will begin launching EC2 instances until they match the target capacity you’ve specified, within the bounds of your maximum bid price. Remember that spot instances can be interrupted by AWS with a two-minute warning if demand for capacity rises, so they are best for fault-tolerant and interruptible workloads.
-