EMR Managed Clusters for Tensorflow and PySpark AI Workflows

Question

Pulumi · Accepted Answer

Creating an EMR (Elastic MapReduce) cluster on AWS can be an excellent choice for running distributed data processing tasks like those required for TensorFlow and PySpark AI workflows. EMR simplifies the process of provisioning, configuring, and tuning clusters running big data frameworks and allows you to process large amounts of data quickly and cost-effectively.

In the program below, we create an EMR cluster that could be used for TensorFlow and PySpark workflows. The cluster includes:

- An EC2 key pair for SSH access to the cluster nodes.
- Configuration for Hadoop MapReduce and YARN (Yet Another Resource Negotiator) to manage resources for distributed applications.
- Configurations for running TensorFlow and PySpark to execute distributed machine learning and data processing tasks.
- A managed scaling policy to automatically scale the EC2 instances based on workload.

Please note that EMR for Machine Learning is an advanced topic, and this code sets up the cluster infrastructural components. However, setting up TensorFlow and PySpark environments, job submission, and workflow management are separate concerns that would typically be part of the application deployment and management process. Furthermore, you should tailor the instance types, scaling policies, and configurations depending on the specific needs and cost considerations of your project.

Here's the Pulumi program in Python for creating an EMR cluster:

```python
import pulumi
import pulumi_aws as aws

# First, we need an EC2 key pair to access cluster instances securely via SSH.
key_pair = aws.ec2.KeyPair("my-key-pair",
                            public_key="ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC...")

# Define the roles and profiles for EMR to access other AWS services.
# This includes creating roles for JobFlow, EC2 instances, and AutoScaling.
job_flow_role = aws.iam.Role("emr-job-flow-role",
                             assume_role_policy=aws.iam.get_policy_document(statements=[{
                                 "actions": ["sts:AssumeRole"],
                                 "principals": [{
                                     "type": "Service",
                                     "identifiers": ["elasticmapreduce.amazonaws.com"],
                                 }],
                             }]).json)

instance_profile = aws.iam.InstanceProfile("emr-instance-profile", role=job_flow_role.name)

auto_scaling_role = aws.iam.Role("emr-auto-scaling-role",
                                 assume_role_policy=aws.iam.get_policy_document(statements=[{
                                     "actions": ["sts:AssumeRole"],
                                     "principals": [{
                                         "type": "Service",
                                         "identifiers": ["application-autoscaling.amazonaws.com"],
                                     }],
                                 }]).json)

# Attach necessary policies to roles.
aws.iam.RolePolicyAttachment("emr-job-flow-role-policy",
                             policy_arn="arn:aws:iam::aws:policy/service-role/AmazonElasticMapReduceRole",
                             role=job_flow_role.name)

aws.iam.RolePolicyAttachment("emr-instance-profile-policy",
                             policy_arn="arn:aws:iam::aws:policy/service-role/AmazonElasticMapReduceforEC2Role",
                             role=job_flow_role.name)

aws.iam.RolePolicyAttachment("emr-autoscaling-role-policy",
                             policy_arn="arn:aws:iam::aws:policy/service-role/AmazonEC2AutoScalingRole",
                             role=auto_scaling_role.name)

# Define the cluster and configure applications.
emr_cluster = aws.emr.Cluster("emr-cluster",
                              release_label="emr-6.0.0",  # Use an appropriate EMR release.
                              applications=["Hadoop", "Spark"],  # Include other necessary applications.
                              ec2_attributes=aws.emr.ClusterEc2AttributesArgs(
                                  key_name=key_pair.key_name,
                                  instance_profile=instance_profile.arn,
                              ),
                              master_instance_group=aws.emr.ClusterMasterInstanceGroupArgs(
                                  instance_type="m5.xlarge",
                                  instance_count=1,
                              ),
                              core_instance_group=aws.emr.ClusterCoreInstanceGroupArgs(
                                  instance_type="m5.xlarge",
                                  instance_count=2,  # Start with a small cluster; adjust as needed.
                              ),
                              configurations="["
                                             "{\"Classification\":\"spark\",\"Properties\":{\"maximizeResourceAllocation\":\"true\"}},"
                                             "{\"Classification\":\"spark-defaults\",\"Properties\":{\"spark.executor.memory\":\"4g\",\"spark.driver.memory\":\"4g\"}},"
                                             "{\"Classification\":\"yarn-site\",\"Properties\":{\"yarn.nodemanager.vmem-check-enabled\":\"false\",\"yarn.nodemanager.pmem-check-enabled\":\"false\",\"yarn.nodemanager.vmem-pmem-ratio\":\"4\"}}"
                                             "]",
                              visible_to_all_users=True,
                              job_flow_role=job_flow_role.arn,
                              service_role=auto_scaling_role.arn,
                              tags={"Name": "emr-cluster"})

# Configure managed scaling policy to automatically resize the cluster.
managed_scaling_policy = aws.emr.ManagedScalingPolicy("emr-managed-scaling-policy",
                                                      cluster_id=emr_cluster.id,
                                                      compute_limits=[
                                                          aws.emr.ManagedScalingPolicyComputeLimitArgs(
                                                              unit_type="Instances",
                                                              minimum_capacity_units=1,
                                                              maximum_capacity_units=20,  # Set limits based on your workload.
                                                          ),
                                                      ])

pulumi.export('emr_cluster_id', emr_cluster.id)
```

This code defines and creates an EMR cluster with:

- An EC2 key pair for accessing the cluster instances.
- IAM roles for EMR to function correctly and access other AWS services.
- A master and core instance group specifying the instance types used for the cluster. The instance types and counts can be customized to fit the workload.
- Cluster applications that include Hadoop and Spark, both of which are prerequisites for running PySpark. TensorFlow would need to be set up on top of this configuration.
- A scaling policy that will manage the cluster size based on job queuing and processing load.

Please configure your AWS credentials before running this code, as Pulumi uses them to provision resources on your behalf.

After the cluster is created, you would then configure TensorFlow, PySpark, and any necessary job flows to start processing your AI workflows. Remember that the EMR configurations for TensorFlow and PySpark might require further fine-tuning, and you may need additional IAM permissions or software packages, depending on your specific use case.