Architect fault-tolerant purposes with occasion fleets on Amazon EMR on EC2

March 17, 2025

15

Organizations depend on Amazon EMR on EC2 clusters to course of large-scale knowledge workloads utilizing frameworks like Apache Spark, Apache Hive, and Trino. Occasions corresponding to TV ads or unplanned promotions may result in a rise in demand of compute capability, making efficient capability planning crucial to ensure your workloads don’t hit capability limits or job failures.

A standard state of affairs is to run each day Spark jobs on Amazon EMR utilizing constant Amazon Elastic Compute Cloud (Amazon EC2) occasion varieties (for instance, a single occasion dimension and household for the cluster). Though this may work properly to maintain the baseline, spikes can set off auto scaling, which narrows the possibilities of capability availability when attempting to cease and relaunch a bigger EMR cluster, as a result of the precise on-demand occasion pool may lack capability to fulfill the demand.

On this put up, we present tips on how to optimize capability by analyzing EMR workloads and implementing methods tailor-made to your workload patterns. We stroll by assessing the historic compute utilization of a workload and use a mixture of methods to scale back the chance of InsufficientCapacityExceptions (ICE) when Amazon EMR launches particular EC2 occasion varieties. We implement versatile occasion fleet methods to scale back dependency on particular occasion varieties and use Amazon EC2 On-Demand Capability Reservation (ODCRs) for predictable, steady-state workloads. Following this method might help forestall job failures because of capability limits whereas optimizing your cluster for value and efficiency.

Resolution overview

Occasion fleets in Amazon EMR supply a versatile and strong solution to handle EC2 situations inside your cluster. This function means that you can specify goal capacities for On-Demand and Spot Cases, choose as much as 5 EC2 occasion varieties per fleet (or 30 when utilizing the AWS Command Line Interface [AWS CLI] and API with an allocation technique), and use a number of subnets throughout completely different Availability Zones. Importantly, occasion fleets help using ODCRs, enabling you to align your EMR clusters with pre-purchased EC2 capability. You possibly can configure your occasion fleet to want or require capability reservations, ensuring that your EMR clusters use your reserved capability effectively.

EMR workload patterns usually fall into two classes: steady and variable (spiky). Within the following sections, we discover tips on how to optimize for every sample utilizing varied choices obtainable with occasion fleets, beginning with steady workloads after which addressing variable workloads.

Secure workloads are workloads with a predictable sample of useful resource utilization over time; for instance, a pharmaceutical supplier must course of 21 TB of analysis knowledge, affected person data, and different data each day. The workload is constant and must run reliably each day on long-running persistent clusters. For important enterprise operations requiring excessive reliability and assured capability, we advocate reserving the baseline capability as a part of your capability planning. We display the next steps:

Use AWS Value and Utilization Experiences (AWS CUR) to estimate the baseline of current workloads.
Reserve the baseline capability utilizing ODCR.
Configure Amazon EMR to make use of the focused ODCR.

Spiky workloads are outlined by unpredictable and sometimes vital fluctuations in processing calls for. These surges might be triggered by varied elements (corresponding to batch processing, real-time knowledge streaming, or seasonal enterprise fluctuations) that set off Amazon EMR to request extra capability to match the demand. We tackle the useful resource allocation through the use of occasion and Availability Zone flexibility, with the next steps:

Introduce EC2 occasion flexibility with EMR occasion fleets.
Obtain resiliency by clever subnet choice with EMR occasion fleets.
Use managed scaling to routinely handle scaling out and in.

Secure workloads

On this part, we display tips on how to outline your baseline, configure AWS Id and Entry Administration (IAM) permissions, create an ODCR, and affiliate your reservations to a capability group and configure Amazon EMR to make use of focused ODCRs. You possibly can go for a combined ODCR technique—for instance, one ODCR with a brief interval of length that helps the launch of your EMR cluster, and one other ODCR with an extended interval of length that helps your activity nodes based mostly on the baseline capability reservation.

Estimate the baseline

Make certain to activate the AWS generated value allocation tag aws:elasticmapreduce:job-flow-id. This allows the sphere resource_tags_aws_elasticmapreduce_job_flow_id within the AWS CUR to be populated with the EMR cluster ID and is utilized by the SQL queries within the answer. To activate the fee allocation tag from the AWS Billing Console, full the next steps:

On the AWS Billing and Value Administration console, select Value allocation tags within the navigation pane.
Underneath AWS generated value allocation tags, select the aws:elasticmapreduce:job-flow-id tag.
Select Activate.

It may well take as much as 24 hours for tags to activate. For extra data, see right here.

After the tags are activated, you need to use AWS CUR and carry out the next question on Amazon Athena to search out the compute sources utilized by the EMR cluster ID vs. the timeline of utilization. For extra particulars, see Querying Value and Utilization Experiences utilizing Amazon Athena. Replace the next question together with your CUR desk identify, EMR cluster ID, desired timestamps, and AWS account ID, and run the question on Athena:

SELECT bill_payer_account_id as Payer,
    product_product_family as PFamily,
    product_product_name as PName,
    resource_tags_aws_elasticmapreduce_job_flow_id,
    line_item_usage_account_id as LinkedAccount,
    line_item_usage_start_date as UsageDate,
    bill_billing_period_start_date as BillingDate,
    SPLIT_PART(line_item_usage_type, ':', 2) AS InstanceType,
    line_item_availability_zone AS AvailabilityZone,
    COUNT(line_item_resource_id) as ResourceIDCount
FROM <YOUR_CUR_TABLE_NAME>
WHERE (
        line_item_usage_start_date BETWEEN TIMESTAMP 'YYYY-MM-DD 00:00:00'
        AND TIMESTAMP 'YYYY-MM-DD 23:59:59' 
    )
    AND line_item_operation LIKE '%%RunInstance%%'
    AND line_item_line_item_type LIKE '%%Utilization%%'
    AND product_product_family NOT IN ('Information Switch')
    AND resource_tags_aws_elasticmapreduce_job_flow_id LIKE '%%<emr-cluster-id>%%'
    AND line_item_usage_account_id IN (
        '<aws_account_id>'
)
GROUP BY 1,2,3,4,5,6,7,8,9

For example, the previous question filters situations utilization per hour for a given account and EMR cluster for the interval of 6 months, to generate the next determine. You possibly can export the leads to CSV format and analyze the information. Now that you’ve a visible illustration of your workloads’ baseline and bursts, you possibly can outline the technique and configuration of your EMR cluster.

Create an ODCR to order the baseline capability

ODCRs might be both open or focused:

With an open ODCR, new situations and current situations which have matching attributes (corresponding to working system or occasion kind) will run utilizing the capability reservation attributes first.
With a focused ODCR, situations should match the attributes of the ODCR specification and the ODCR is particularly focused at launch. This method is really useful when you have a number of concurrent EMR clusters consuming capability from the shared On-Demand pool of EC2 situations. EMR clusters bigger than the focused ODCR amount will fall again to On-Demand Cases which are in the identical Availability Zone.

On this instance, we use a focused ODCR with an EMR occasion fleet within the us-east-1a Availability Zone. The next diagram illustrates the workflow.

Full the next steps:

Use the create-capacity-reservation AWS CLI command to create the ODCR and make an observation of the CapacityReservationArn worth within the output:

aws ec2 create-capacity-reservation 
     --availability-zone <Enter Your Availability Zone> 
     --instance-type r8g.2xlarge 
     --instance-match-criteria focused 
     --instance-platform Linux/UNIX 
     --instance-count <enter the variety of situations out of your baseline estimation>

We get the next output:

{
     "CapacityReservation": {
         "CapacityReservationId": "cr-0123456f9907xxxxx",
         "OwnerId": "XXXX",
         "CapacityReservationArn": "arn:aws:ec2:us-east-1:XXXX:capacity-reservation/cr-0123456f9907xxxxx",
         "InstanceType": "r8g.2xlarge",
         "InstancePlatform": "Linux/UNIX",
         "AvailabilityZone": "us-east-1a"

 ....
     }
 }

You should utilize Amazon CloudWatch to watch ODCR utilization and set off an alert for unused capability. For extra particulars, see Monitor Capability Reservations utilization with CloudWatch metrics.

Create a useful resource group named EMRSparkSteadyStateGroup and make an observation of GroupArn values within the output:

aws resource-groups create-group --name EMRSparkSteadyStateGroup 
--configuration '{"Kind":"AWS::EC2::CapacityReservationPool"}' '{"Kind":"AWS::ResourceGroups::Generic", "Parameters":[{"Name":"allowed-resource-types","Values":["AWS::EC2::CapacityReservation"]}]}'

We get the next output:

"Group": {
         "GroupArn": "arn:aws:resource-groups:us-east-1:XXXX:group/EMRSparkSteadyStateGroup",
         "Title": "EMRSparkSteadyStateGroup"
     }, ...

Use the next code to affiliate the capability reservation to the useful resource group. You possibly can have a number of capability reservations related to a useful resource group.

aws resource-groups group-resources --group EMRSparkSteadyStateGroup 
 --resource-arns arn:aws:ec2:us-east-1:XXXX:capacity-reservation/cr-0123456f9907xxxxx

As a greatest apply for efficient administration and cleanup, Create a tag Goal=EMR-Spark-Regular-State for the newly created ODCR and the useful resource group.

# Tag your Capability Reservation
 aws ec2 create-tags 
 --resources cr-0123456f9907xxxxx 
 --tags Key=Goal,Worth=EMR-Spark-Regular-State
# Tag your Useful resource Group
 aws resource-groups tag 
 --arn "arn:aws:resource-groups:us-east-1:XXXX:group/EMRSparkSteadyStateGroup" --tags Goal=EMR-Spark-Regular-State

Implement Amazon EMR with ODCR

Full the next steps to create an EMR cluster tagged with the precise focused ODCR:

Add required permissions to the EMR service position earlier than utilizing capability reservations. With these permissions, you possibly can lock down the useful resource with the precise Amazon Useful resource Title (ARN) of the group identify to be created with the next code:

{
     "Model": "2012-10-17",
     "Assertion": [
         {
             "Effect": "Allow",
             "Resource": "*",
             "Action": [
                 "ec2:CreateFleet",
                 "ec2:RunInstances",
                 "ec2:CreateLaunchTemplate",
                 "ec2:CreateLaunchTemplateVersion",
                 "ec2:DeleteLaunchTemplateVersions",
                 "ec2:DescribeCapacityReservations",
                 "ec2:DescribeLaunchTemplateVersions",
                 "resource-groups:ListGroupResources"
             ]
         }
     ]
 }

Configure the EMR cluster to make use of ODCR with occasion fleets. We use the CapacityReservationOptions parameter to configure the EMR cluster, as proven within the following instance:

  {
 ...
     "LaunchSpecifications": {
       "OnDemandSpecification": {
         "AllocationStrategy": "LOWEST_PRICE",
         "CapacityReservationOptions": {
           "UsageStrategy": "USE_CAPACITY_RESERVATIONS_FIRST",
           "CapacityReservationResourceGroupArn": "arn:aws:resource-groups:us-east-1:xxxxxx:group/EMRSparkSteadyStateGroup"
         }
       }
     }
   }

The next step-by-step breakdown illustrates the Amazon EMR decision-making course of when prioritizing focused capability reservations, from core node provisioning by activity node allocation:

Cluster provisioning initiation:
- The person chooses to override the lowest-price allocation technique.
- The person specifies focused capability reservations within the launch request.
Core node provisioning:
- Amazon EMR evaluates all EC2 occasion capability swimming pools with focused capability reservations, and selects the pool with the bottom worth that has adequate capability for all requested core nodes.
- If no pool with focused reservations has adequate capability, Amazon EMR reevaluates all specified EC2 occasion capability swimming pools and selects the lowest-priced pool with adequate capability for core nodes. Out there open capability reservations are utilized routinely.
Availability Zone choice:
- After the core capability is acquired, Amazon EMR locks within the Availability Zone in your cluster.
Main and activity node provisioning:
- Amazon EMR evaluates EC2 occasion capability swimming pools inside that Availability Zone for major and activity fleets. First, Amazon EMR evaluates all of the swimming pools with focused ODCRs specified within the request, ordered by lowest worth by default.
- From the ordered checklist, Amazon EMR launches as a lot capability as doable from the unused focused ODCRs of every occasion pool till the request is fulfilled.
- If the unused focused ODCRs don’t fulfill the request but, Amazon EMR continues to launch the remaining capability into On-Demand swimming pools, within the lowest-price order by default.

For extra particulars in regards to the allocation technique, confer with Allocation technique as an illustration fleets or Amazon EMR Assist for Focused ODCR.

Spiky workloads

Spiky workloads are outlined by unpredictable and sometimes vital fluctuations in processing calls for, triggered by elements corresponding to rare however resource-intensive periodic batch processing jobs. For instance, a geographic data system processes location knowledge from thousands and thousands of customers in actual time to offer up-to-date site visitors data, calculate routes, and counsel factors of curiosity. Person location knowledge is continually being generated, however the quantity can spike dramatically throughout rush hour or particular occasions, as illustrated within the following determine. This graph reveals the variety of used sources (Amazon EC2) by hour; it varies from 1 when the cluster scales in, ready for jobs, to spikes of 1,000 nodes.

In the event you’re operating spiky workloads with restricted flexibility in occasion kind, household, and Availability Zone, you may face ICE errors when the obtainable capability can’t meet the cluster’s scaling necessities. To handle this, we discover a set of greatest practices for EMR cluster creation to maximise availability and stability price-performance. Though spiky workloads current a novel problem in useful resource administration, configuring EMR occasion fleets presents a strong answer. By utilizing various occasion varieties, prioritized allocation methods, Availability Zone flexibility, and managed scaling, organizations can create a strong, cost-effective infrastructure able to dealing with unpredictable workload patterns. This configuration presents the next advantages:

Improved availability – By diversifying occasion varieties and utilizing a number of Availability Zones, the cluster mitigates inadequate capability points
Value financial savings – Allocation methods cut back prices whereas minimizing interruptions
Resilience for spiky workloads – Prioritizing occasion generations offers seamless scaling below various calls for
Optimized efficiency – Managed scaling dynamically adjusts sources to fulfill workload calls for effectively

Introduce EC2 occasion flexibility and occasion fleets with a prioritized allocation technique

Amazon EMR helps occasion flexibility with occasion fleet deployment. Occasion fleets offer you a greater variety of choices and intelligence round occasion provisioning. Now you can present a listing of as much as 30 occasion varieties with corresponding weighted capacities and spot bid costs (together with spot blocks) utilizing the AWS CLI or AWS CloudFormation. Amazon EMR will routinely provision On-Demand and Spot capability throughout these occasion varieties when creating your cluster. This will make it extra easy and less expensive to rapidly acquire and keep your required capability in your clusters. In August 2024, Amazon EMR launched the prioritized allocation technique to reinforce occasion flexibility with occasion fleets. This function means that you can specify precedence ranges in your occasion varieties, enabling Amazon EMR to allocate capability to the highest-priority situations first. This technique helps enhance value financial savings and reduces the time required to launch clusters, even in situations with restricted capability. For extra particulars, see Amazon EMR help prioritized and capacity-optimized-prioritized allocation methods for EC2 situations. To maximise cost-efficiency and availability for spiky workloads, mix the price-performance benefits of new-generation situations with the broader availability of previous-generation situations. For workloads with strict latency necessities, repair the occasion dimension to take care of constant efficiency. This method takes benefit of the strengths of each occasion generations, offering flexibility and reliability reducing the chance of capability constraints. For On-Demand nodes, select the prioritized allocation technique, so the cluster tries to make use of newer-generation situations first. Whereas configuring the occasion fleet, organize situations in a prioritized order reflecting price-performance and availability trade-offs, for instance:

Main node – m8g.12xlarge > m8g.16xlarge > m7g.12xlarge > m7g.16xlarge
Core node – r8g.8xlarge > r8g.12xlarge > r7g.8xlarge > r6g.16xlarge > r5.16xlarge
Job Node – r8g.8xlarge > r8g.12xlarge > r7g.8xlarge > r6g.16xlarge > r5.16xlarge

For Spot Cases, be certain the capacity-optimized prioritized allocation technique is chosen to scale back interruptions. See the next CloudFormation template snippet for example:

...
       "Properties": {
         "Cases": {
          "MasterInstanceFleet": {
            "Title": "cfnMaster",
            "InstanceTypeConfigs": [
               {
                 "BidPrice": "10.50",
                 "InstanceType": "m5.xlarge",
                 "Priority": "1",
 ...
             "LaunchSpecifications": {
               "SpotSpecification": {
                 "TimeoutAction": "SWITCH_TO_ON_DEMAND",
                 "TimeoutDurationMinutes": 20,
                 "AllocationStrategy": "CAPACITY_OPTIMIZED_PRIORITIZED"
               },
               "OnDemandSpecification": {
                "AllocationStrategy": "PRIORITIZED"
               }
 ...

Select subnets with EMR instance fleets

When creating a cluster, specify multiple EC2 subnets within a virtual private cloud (VPC), each corresponding to a different Availability Zone. Amazon EMR provides multiple subnet (Availability Zone) options by employing subnet filtering at cluster launch, and selects one of the subnets that has adequate available IP addresses to successfully launch all instance fleets. If Amazon EMR can’t find a subnet with sufficient IP addresses to launch the whole cluster, it will prioritize the subnet that can at least launch the core and primary instance fleets.

Use managed scaling

Managed scaling is another powerful feature of Amazon EMR that automatically adjusts the number of instances in your cluster based on workload demands. This makes sure that your cluster scales up during periods of high demand to meet processing requirements and scales down during idle times to save costs. With managed scaling, you can set minimum and maximum scaling limits, giving you control over costs while benefiting from an optimized and efficient cluster performance.

The following workflow illustrates Amazon EMR configured with instance fleets and managed scaling.

The workflow consists of the following steps:

The user defines the EMR instance configurations and instance types, along with their launch priority.
The user selects subnets for the Amazon EMR configuration to provide Availability Zone flexibility.
Amazon EMR calls the Amazon EC2 Fleet API to provision instances based on the allocation strategy.
The EMR instance fleet is launched.
The cycle is repeated for scaling operations within the launched Availability Zone, providing optimized performance and scalability.

Conclusion

In this post, we demonstrated how to optimize capacity by analyzing EMR workloads and implementing strategies tailored to your workload patterns. As you implement any of the preceding strategies, remember to continuously monitor your cluster’s performance and adjust configurations based on your specific workload patterns and business needs. With the right approach, the challenges of spiky workloads can be transformed into opportunities for optimized performance and cost savings.

To effectively manage workloads with both baseline demands and unexpected spikes, consider implementing a hybrid approach in Amazon EMR. Use ODCRs for consistent baseline capacity and configure instance fleets with a strategic mix of ODCR, On-Demand, and Spot Instances prioritizing ODCR usage.

Try these strategies with your own use case, and leave your questions in the comments.

About the Authors

Deepmala Agarwal works as an AWS Data Specialist Solutions Architect. She is passionate about helping customers build out scalable, distributed, and data-driven solutions on AWS. When not at work, Deepmala likes spending time with family, walking, listening to music, watching movies, and cooking!

Suba Palanisamy is a Senior Technical Account Manager, helping customers achieve operational excellence on AWS. Suba is passionate about all things data and analytics. She enjoys traveling with her family and playing board games.

Flavio Torres is a Principal Technical Account Manager at AWS. Flavio helps Enterprise Support customers design, deploy, and scale resilient cloud applications. Outside of work, he enjoys hiking and barbecuing.

Architect fault-tolerant purposes with occasion fleets on Amazon EMR on EC2

Resolution overview

Secure workloads

Estimate the baseline

Create an ODCR to order the baseline capability

Implement Amazon EMR with ODCR

Spiky workloads

Introduce EC2 occasion flexibility and occasion fleets with a prioritized allocation technique

Select subnets with EMR instance fleets

Use managed scaling

Conclusion

About the Authors

Related Articles

2025 Amazon Prime Day drone offers: One of the best reductions for aerial photographers

Quercetin nanoformulation-embedded hydrogel inhibits osteopontin mediated ferroptosis for intervertebral disc degeneration alleviation | Journal of Nanobiotechnology

Stuart J. Russell wins 2025 AAAI Award for Synthetic Intelligence for the Good thing about Humanity

LEAVE A REPLY Cancel reply

Latest Articles

2025 Amazon Prime Day drone offers: One of the best reductions for aerial photographers

Quercetin nanoformulation-embedded hydrogel inhibits osteopontin mediated ferroptosis for intervertebral disc degeneration alleviation | Journal of Nanobiotechnology

Stuart J. Russell wins 2025 AAAI Award for Synthetic Intelligence for the Good thing about Humanity

Born of frustration – will SGP.32 repair the IoT chaos (or gasoline it)?

5 Finest Prime Day TV Offers for All Budgets (2025)

ABOUT US