Using aws s3 with emrserverless
TypeScriptAWS S3 (Simple Storage Service) is an object storage service offering scalability, data availability, security, and performance. This can be used to store and retrieve any amount of data at any time. AWS EMR Serverless is a serverless deployment option in Amazon EMR that makes it easy to run applications without managing clusters or servers.
In the context of using S3 with EMR Serverless, S3 can act as a data lake where input data is stored for processing, and results can be saved back into S3 after the computation is complete.
Here is a Pulumi program in TypeScript that demonstrates how you might set up AWS S3 to work together with AWS EMR Serverless. The program will do the following:
- Create an S3 bucket to store data.
- Set up a serverless application with AWS EMR Serverless using that S3 bucket. The S3 bucket will be used to store both input and output data for the computations done by the EMR Serverless application.
Make sure before you run this code, your AWS credentials are configured properly, and you have the required permissions to create these resources.
import * as aws from "@pulumi/aws"; import * as pulumi from "@pulumi/pulumi"; const region = aws.config.region; // Create an AWS S3 bucket const dataBucket = new aws.s3.Bucket("dataBucket", { acl: "private", // It's a good practice to enable versioning for data recovery versioning: { enabled: true, }, }); // EMR Serverless application const emrServerlessApp = new aws.emrserverless.Application("emrServerlessApp", { type: "SPARK", // Define the application type, this could be HIVE, SPARK, etc. releaseLabel: "emr-6.6.0", // Define the EMR release label autoStopConfiguration: { enabled: true, idleTimeoutMinutes: 60, }, maximumCapacity: { // Configure the maximum capacity for the application cpu: "4 vCPU", memory: "16 GB", }, initialCapacity: [{ // Define the initial capacity of the application initialCapacityType: "spark-driver", initialCapacityConfig: { workerCount: 1, // Number of workers workerConfiguration: { cpu: "2 vCPU", memory: "4 GB", }, }, }], // Set the S3 location for the application logs networkConfiguration: { subnetIds: ["subnet-xxxxxxxxxx"], // Define your subnetIDs here securityGroupIds: ["sg-xxxxxxxxx"], // Define your securityGroupIDs here }, // Set the S3 location for the application logs monitoringConfiguration: { s3MonitoringConfiguration: { logUri: pulumi.interpolate`${dataBucket.bucket.apply(bucketName => `s3://${bucketName}/logs`)}`, }, }, }); // Export the names and URLs of created resources export const bucketName = dataBucket.bucket; export const bucketArn = dataBucket.arn; export const emrServerlessAppName = emrServerlessApp.name;
In this program, the
aws.s3.Bucket
resource creates a new private S3 bucket with versioning enabled, ensuring that you can track and recover previous versions of your data.The
aws.emrserverless.Application
resource defines an EMR Serverless application, in this case, using Spark. The program specifies that the application should automatically stop after 60 minutes of idleness to avoid unnecessary charges, defines the maximum and initial capacity, and sets up the networking configuration.Of course, you would need to replace the placeholder values for
subnetIds
andsecurityGroupIds
with the actual IDs corresponding to your VPC configuration.You can see the use of the
pulumi.interpolate
syntax to create dynamic outputs—in this case, setting up the log URI to point to the newly created S3 bucket.To run this program, save it in a TypeScript file (e.g.,
index.ts
), install the required dependencies using npm or Yarn, and then use the Pulumi CLI to create the stack and deploy your program withpulumi up
.With this setup, you now have a seamlessly integrated storage and compute service using AWS S3 and EMR Serverless, ready for running big data applications without the need to manage servers or clusters.
Please be aware of AWS pricing and make sure to shut down or delete resources when not in use to avoid incurring costs.