MapReduce原理浅析
函数 | 输入 | 输出 | 说明 |
---|---|---|---|
Map | <k1, v1> | List(<k2,v2>) | 1. 将小数据集进一步解析成一批 <key,value> 对,输入 Map 函数中进行处理。 2. 每一个输入的 <k1,v1> 会输出一批 <k2,v2>。 |
Reduce | <k2,List(v2)> | <k3,v3> | 输入的中间结果 <k2,List(v2)> 中的 |
MapReduce is a framework for processing highly distributable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster (if all nodes use the same hardware) or a grid (if the nodes use different hardware). Computational processing can occur on data stored either in a filesystem (unstructured) or in a database (structured).
“Map” step: The master node takes the input, partitions it up into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node.
“Reduce” step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve.
MapReduce allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the others, all maps can be performed in parallel – though in practice it is limited by the number of independent data sources and/or the number of CPUs near each source. Similarly, a set of ‘reducers’ can perform the reduction phase – provided all outputs of the map operation that share the same key are presented to the same reducer at the same time. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than “commodity” servers can handle – a large server farm can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled – assuming the input data is still available.
Dataflow
The frozen part of the MapReduce framework is a large distributed sort. The hot spots, which the application defines, are:
- an input reader
- a Map function
- a partition function
- a compare function
- a Reduce function
- an output writer
[edit] Input reader
The input reader divides the input into appropriate size ‘splits’ (in practice typically 16 MB to 128 MB) and the framework assigns one split to each Map function. The input reader reads data from stable storage (typically a distributed file system) and generates key/value pairs.
A common example will read a directory full of text files and return each line as a record.
[edit] Map function
Each Map function takes a series of key/value pairs, processes each, and generates zero or more output key/value pairs. The input and output types of the map can be (and often are) different from each other.
If the application is doing a word count, the map function would break the line into words and output a key/value pair for each word. Each output pair would contain the word as the key and “1” as the value.
[edit] Partition function
Each Map function output is allocated to a particular reducer by the application’s partition function for sharding purposes. The partition function is given the key and the number of reducers and returns the index of the desired reduce.
A typical default is to hash the key and modulo the number of reducers. It is important to pick a partition function that gives an approximately uniform distribution of data per shard for load balancing purposes, otherwise the MapReduce operation can be held up waiting for slow reducers to finish.
Between the map and reduce stages, the data is shuffled (parallel-sorted / exchanged between nodes) in order to move the data from the map node that produced it to the shard in which it will be reduced. The shuffle can sometimes take longer than the computation time depending on network bandwidth, CPU speeds, data produced and time taken by map and reduce computations.
[edit] Comparison function
The input for each Reduce is pulled from the machine where the Map ran and sorted using the application’s comparison function.
[edit] Reduce function
The framework calls the application’s Reduce function once for each unique key in the sorted order. The Reduce can iterate through the values that are associated with that key and output 0 or more values.
In the word count example, the Reduce function takes the input values, sums them and generates a single output of the word and the final sum.
[edit] Output writer
The Output Writer writes the output of the Reduce to stable storage, usually a distributed file system.
源地址:http://blog.renren.com/GetEntry.do?id=743060929&owner=253392392