Wednesday, 17 August 2016

Guide to plan your cluster memory in MapReduce2/ Yarn

Planning memory allocation is very important in MR-2. In this post, I will explain how to plan memory resource MR-2.

Scenario: Let us image that, we have a four node Hadoop cluster.  One master and three datanodes. Each datanode has following configuration:

Memory: 48GB.
vCPU: 12.
Disk: 12.

We need to plan the number of map and reduce tasks that will be running in the cluster and in each datanode.

Understanding a Container: Container can be considered as a block or unit, that has resources like CPU,memory etc:



YARN manages all the available resources on each machine in the cluster. Based on the total available resources, YARN will allocate resource requests for applications such as MapReduce running in the cluster. YARN then provides processing capacity to each application ( map or reduce) by allocating



Core and Disk: General thumb rule is to allocate 1-2 Containers per disk and per core gives the best balance for cluster utilization. So with our example cluster each datanode with 12 disks and 12 cores, we will allow for 20 maximum containers to be allocated to each node. We will give rest to OS.

Memory: In our case, we have 3 datanode hadoop cluster each of 48GB. So as a pool if we consider, we have 144GB available in total.

(Memory in each node) x (Number of datanodes)= Total available memory.
48x3=144GB

Now, we cannot give all availabe resource to Yarn. Some memory has to be given to OS of each server. The thumb rule is allocate 8GB to OS. Remaining we have 40GB in each node. So, our calculation will be:

(Memory in each node - 8GB for OS) x (Number of datanodes)= Total memory availabe for Yarn

(48-8)x3=120GB.

The following property sets the maximum memory YARN can utilize on the node:

In yarn-site.xml:

<name>yarn.nodemanager.resource.memory-mb</name>
<value>40960</value>

The above property means that 40GB is available for Yarn on each node.

Next step is to say to Yarn, how to break up the total resources available into Containers. You do this by specifying the minimum unit of RAM to allocate for a Container. According to our calculation we need to run a maximum of 20 Containers in a node, and thus need (40 GB total RAM) / (20 of Containers) = 2 GB minimum per container:

In yarn-site.xml:

 <name>yarn.scheduler.minimum-allocation-mb</name>
 <value>2048</value>

YARN will allocate Containers with RAM amounts greater than the yarn.scheduler.minimum-allocation-mb.

Configure MapReduce 2:

When configuring MapReduce2 resource utilization, there are three aspects to consider:

1.Physical RAM limit for each Map and Reduce task
2.The JVM heap size limit for each task
3.The amount of virtual memory each task will get

We can define how much maximum memory each Map and Reduce task will utilize. Since each Map and each Reduce will run in a separate Container, these maximum memory settings should be at least equal to or more than the YARN minimum Container allocation.

For our example cluster, we have the minimum RAM for a Container (yarn.scheduler.minimum-allocation-mb) = 2 GB. We’ll thus assign 4 GB for Map task Containers, and 8 GB for Reduce tasks Containers.

In mapred-site.xml:

 <name>mapreduce.map.memory.mb</name>
 <value>4096</value>
 <name>mapreduce.reduce.memory.mb</name>
 <value>8192</value>


Each Container will run JVMs for the Map and Reduce tasks. The JVM heap size should be set to lower than the Map and Reduce memory defined above, so that they are within the bounds of the Container memory allocated by YARN.

In mapred-site.xml:

 <name>mapreduce.map.java.opts</name>
 <value>-Xmx3072m</value>
 <name>mapreduce.reduce.java.opts</name>
 <value>-Xmx6144m</value>

The above settings configure the upper limit of the physical RAM that Map and Reduce tasks will use. The virtual memory (physical + paged memory) upper limit for each Map and Reduce task is determined by the virtual memory ratio each YARN Container is allowed. This can  be set by the following configuration, and the default value is 2.1:

In yarn-site.xml:

 <name>yarn.nodemanager.vmem-pmem-ratio</name>
 <value>2.1</value>


Thus, with the above settings on our example cluster, each Map task will get the memory allocations with the following:


Total physical RAM allocated = 4 GB
JVM heap space upper limit within the Map task Container = 3 GB
Virtual memory upper limit = 4*2.1 = 8.2 GB

With YARN and MapReduce 2, there are no longer pre-configured static slots for Map and Reduce tasks. The entire cluster is available for dynamic resource allocation of Maps and Reduces as needed by the job. In our example cluster, with the above configurations, YARN will be able to allocate on each node up to 10 mappers (40/4) or 5 reducers (40/8) or a permutation within that.
 

No comments:

Post a comment

Note: only a member of this blog may post a comment.