
AWS EMR
Amazon EMR is a web-based service that uses a hosted Hadoop framework and runs on the web-scale infrastructures of EC2 or S3.
EMR allows researchers, developers, data analysts, and businesses to quickly and economically process large amounts of data.
EMRuses Apache Hadoop is its distributed data processing engine. It is an open-source Java software that supports distributed, data-intensive applications on large clusters using commodity hardware.
Ideal for large data volumes that require fast and efficient processing
Let the data crunch and analysis take center stage without worrying about the time-consuming setup, management or tuning for Hadoop clusters or the compute power
Can help with data-intensive tasks such as web indexing and data mining, log file analyses, machine learning, financial analysis and bioinformatics research, etc
Provides web service interface to launch clusters and monitor cluster processing-intensive computation
This batch-processing framework measures the common processing time duration in minutes to days. If the use case is to have processing in real time or within minutes Apache Spark and Storm would be a better choice.
EMR seamlessly supports Reserved, Spot, or On-Demand Instances
EMR launches all cluster nodes in the same EC2 Availability Zone. This improves performance and provides a higher data access rate.
EMR supports several EC2 instance types, including Standard, High CPU and High Memory, Cluster Compute and High I/O. Instances have memory-to-processor ratios that are suitable for most general-purpose purposes.
High CPU instances have a proportionally greater CPU resource than memory (RAM), and are well-suited for compute intensive applications.
High Memory instances provide large memory sizes for high throughput application
Cluster Compute instances have a proportionally high CPU and increased network performance. They are well-suited for High Performance Compute applications and other network-bound demanding applications.
High Storage instances provide 48 TB storage on 24 disks. They are ideal for applications that need sequential access to large data sets, such as data warehouse and log processing.
EMR charges are assessed on hourly increments, i.e. Once the cluster is up and running, charges will apply for the entire hour
EMR integrates to CloudTrail to record AWS API CallsNOTE: This topic is primarily for Solution Architect Professional & Analytics – Speciality Exam Only
EMR Architecture
Amazon EMR uses Hadoop software, which is industry-proven and fault-tolerant, as its data processing engine
Hadoop is an open-source Java software that supports distributed data-intensive applications on large clusters using commodity hardware.
Hadoop divides the data into multiple subsets, and assigns each one to more than one EC2 instance. If an EC2 instance fails processing a subset of data it can be used to retrieve the results from another Amazon EC2 instance.
EMR is composed of Master node, one of more slave nodesMaster NodeEMR currently doesn’t support automatic failover or master state recovery.
If the master node is down, the EMR cluster will be shut down and the job must be re-executed.
Slave nodes – Core nodes, Task nodesCore nodeshost persistent data using Hadoop Distributed File System HDFS and run Hadoop tasks
In an existing cluster, can be increased
Hadoop tasks can only be run by task nodes
In an existing cluster, it is possible to increase or decrease the size of the group.
EMR is fault-tolerant for slave failures, and continues job execution if the slave node goes down.
EMR currently does not automatically provision another node for the transfer of failed slaves.
EMR supports Bootstrap actions, which allows users to run custom set up prior to the execution.
This can be used to configure and install software before running the clusterEMR security
EMR cluster begins with different s