Important Links

Parameters need tuning based on instance type

  • --conf spark.dynamicAllocation.enabled = true | false
  • --conf spark.executor.instances = 2 || --num-executors 2
  • --conf spark.executor.memory = 2g || --executor-memory 2g
  • --conf spark.executor.cores = 2 || --executor-cores 2

spark.yarn.executor.memoryOverhead=max(384, .10 * spark.executor.memory)

Control Parallelism

  • --conf spark.default.parallelism = 200
  • --conf spark.sql.shuffle.partitions = 200
  • Minimum Partitions
                    
                      /**
                       * Default min number of partitions for Hadoop RDDs when not given by user
                       * Notice that we use math.min so the "defaultMinPartitions" cannot be higher than 2.
                       * The reasons for this are discussed in https://github.com/mesos/spark/pull/718
                       */
                      def defaultMinPartitions: Int = math.min(defaultParallelism, 2)
                    
                    

How much memory each tasks would have?

  • --conf spark.memory.fraction = 0.6
  • --conf spark.memory.storageFraction = 0.5

Task Memory = (spark.executor.memory * spark.memory.fraction * spark.memory.storageFraction) / spark.executor.cores

Tiny executors [One Executor per core]

 Change Number of NodesReserved for OS/Hadoop/YARN Calculated Values
      
m4.4xlargeNodes10   
      
 CPU Cores1615  
 RAM64 GB63  
    --executor-cores1
    executor-per-node15
    --num-executors149
    executor-memory with overhead4
    --executor-memory3

Fat executors [One Executor per node]

 Change Number of NodesReserved for OS/Hadoop/YARN Calculated Values
      
m4.4xlargeNodes10   
      
 CPU Cores1615  
 RAM64 GB63  
    --executor-cores15
    executor-per-node1
    --num-executors9
    executor-memory with overhead63
    --executor-memory56

Balanced Fat and Tiny

 Change Number of NodesReserved for OS/Hadoop/YARN Calculated Values
      
m4.4xlargeNodes10   
      
 CPU Cores1615  
 RAM64 GB63  
    --executor-cores4
    executor-per-node3
    --num-executors29
    executor-memory with overhead21
    --executor-memory18

Happy Coding!!!

						
						package org.apache.hadoop.mapred;
						/*Import Statements*/
						public abstract class FileInputFormat <K , V> implements InputFormat <K , V> {
						....
							protected long computeSplitSize(long goalSize,
														long minSize,
														long blockSize) {
								return Math.max(minSize, Math.min(goalSize, blockSize));
							}
						...
						}
						
					

How does Hadoop & Spark Control Partitioning ?

To Be Continued...