1. Answers
  2. Scheduling regular data processing jobs with EMR Serverless and EventBridge

How do I schedule regular data processing jobs with EMR Serverless and EventBridge?

In this guide, we will demonstrate how to schedule regular data processing jobs using AWS EMR Serverless and EventBridge. We will create an EMR Serverless application and use EventBridge to trigger the job on a regular schedule.

Key Points

  • EMR Serverless: A serverless option for running big data applications.
  • EventBridge: A serverless event bus service for managing event-driven architectures.
  • IAM Role: Required for granting permissions to the EMR Serverless application.

Below is the Pulumi program that accomplishes this:

import * as pulumi from "@pulumi/pulumi";
import * as aws from "@pulumi/aws";

// Create an IAM Role for EMR Serverless
const emrRole = new aws.iam.Role("emrRole", {
    assumeRolePolicy: aws.iam.assumeRolePolicyForPrincipal({ Service: "emr-serverless.amazonaws.com" }),
});

// Attach necessary policies to the IAM Role
const emrRolePolicy = new aws.iam.RolePolicyAttachment("emrRolePolicy", {
    role: emrRole.name,
    policyArn: "arn:aws:iam::aws:policy/service-role/AmazonEMRServerlessServiceRolePolicy",
});

// Create an EMR Serverless application
const emrApp = new aws.emrcontainers.VirtualCluster("emrApp", {
    name: "emr-serverless-app",
    containerProvider: {
        id: "emr-serverless",
        type: "EKS",
        info: {
            eksInfo: {
                namespace: "default",
            },
        },
    },
    tags: {
        Environment: "production",
    },
});

// Create an EventBridge rule to schedule the EMR job
const eventRule = new aws.cloudwatch.EventRule("eventRule", {
    scheduleExpression: "rate(1 day)",
});

// Create an EventBridge target to trigger the EMR job
const eventTarget = new aws.cloudwatch.EventTarget("eventTarget", {
    rule: eventRule.name,
    arn: emrApp.arn,
    roleArn: emrRole.arn,
    input: JSON.stringify({
        name: "emr-serverless-job",
        virtualClusterId: emrApp.id,
        executionRoleArn: emrRole.arn,
        releaseLabel: "emr-6.3.0",
        jobDriver: {
            sparkSubmitJobDriver: {
                entryPoint: "s3://my-bucket/my-script.py",
                entryPointArguments: ["--arg1", "value1"],
                sparkSubmitParameters: "--conf spark.executor.memory=2g --conf spark.executor.cores=2",
            },
        },
        configurationOverrides: {
            monitoringConfiguration: {
                s3MonitoringConfiguration: {
                    logUri: "s3://my-bucket/logs/",
                },
            },
        },
    }),
});

// Grant EventBridge permissions to invoke the EMR job
const eventPolicy = new aws.iam.RolePolicy("eventPolicy", {
    role: emrRole.id,
    policy: pulumi.interpolate`{
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": "emr-containers:StartJobRun",
                "Resource": "${emrApp.arn}"
            }
        ]
    }`,
});

Summary

In this guide, we created an EMR Serverless application and scheduled a regular data processing job using EventBridge. We set up an IAM role with the necessary permissions, created an EventBridge rule to trigger the job on a daily schedule, and configured the job parameters.

This setup ensures that your data processing jobs are automatically triggered and managed using AWS serverless services, reducing the operational overhead and simplifying the workflow.

Deploy this code

Want to deploy this code? Sign up for a free Pulumi account to deploy in a few clicks.

Sign up

New to Pulumi?

Want to deploy this code? Sign up with Pulumi to deploy in a few clicks.

Sign up