MapReduce中有三个步骤用于划分大数据集, 给mapper和reducer提供数据
InputSplit
第一个是InputSplit, 它把数据划分成若干块提供给mapper
默认情况下是根据数据文件的block, 来划分, 一个block对应一个mapper, 优先在block所在的机器上启动mapper
如果要重构这个 InputSplit 函数的话, 要去 InputFormat 里重构 getSplits 函数
https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/mapred/InputFormat.html
在streaming中:
-inputformat JavaClassName | Optional | Class you supply should return key/value pairs of Text class. If not specified, TextInputFormat is used as the default |
-outputformat JavaClassName | Optional | Class you supply should take key/value pairs of Text class. If not specified, TextOutputformat is used as the default |
这两个参数指定姚世勇inputformat class
Partition
partition用于把结果分配给不同的reducer, 一般继承自 "org.apache.hadoop.mapreduce.Partitioner" 这个类
Grouping
这个概念比较难理解, 意思是在数据给reducer前再进行一次分组, 一组数据给到同一个reducer执行一次, 他们的key用的是分组中第一个数据的key
https://stackoverflow.com/questions/14728480/what-is-the-use-of-grouping-comparator-in-hadoop-map-reduce
最佳答案中 a-1和a-2因为grouping的关系合并成了 a-1为key的一组数据给reducer处理
那么在streaming中Partition和Grouping该怎么处理呢?
在streaming中可以用命令行参数指定Partition的类:
-partitioner JavaClassName | Optional | Class that determines which reduce a key is sent to |
也可以用另一种参数结合sort命令来指定:
-D stream.map.output.field.separator=. \ -D stream.num.map.output.key.fields=4 \ -D mapred.text.key.partitioner.options=-k1,2 \
这里指定了分割符, 并且分割出来前4个field是key, 并用第一和第二个field来做partition
-D mapreduce.partition.keycomparator.options='-k1,2 -k3,3nr -k4,4nr'
linux中的sort命令:
sort -k1 -k2n -k3nr #表示优先根据第一列排序, 再根据第二列排序且第二列是数字,再根据第三列排序它是数字而且要逆序来排
grouping在streaming的模式中没有相应实现, 但是可以利用partition来代替.
附表:
Parameter | Optional/Required | Description |
---|---|---|
-input directoryname or filename | Required | Input location for mapper |
-output directoryname | Required | Output location for reducer |
-mapper executable or JavaClassName | Required | Mapper executable |
-reducer executable or JavaClassName | Required | Reducer executable |
-file filename | Optional | Make the mapper, reducer, or combiner executable available locally on the compute nodes |
-inputformat JavaClassName | Optional | Class you supply should return key/value pairs of Text class. If not specified, TextInputFormat is used as the default |
-outputformat JavaClassName | Optional | Class you supply should take key/value pairs of Text class. If not specified, TextOutputformat is used as the default |
-partitioner JavaClassName | Optional | Class that determines which reduce a key is sent to |
-combiner streamingCommand or JavaClassName | Optional | Combiner executable for map output |
-cmdenv name=value | Optional | Pass environment variable to streaming commands |
-inputreader | Optional | For backwards-compatibility: specifies a record reader class (instead of an input format class) |
-verbose | Optional | Verbose output |
-lazyOutput | Optional | Create output lazily. For example, if the output format is based on FileOutputFormat, the output file is created only on the first call to output.collect (or Context.write) |
-numReduceTasks | Optional | Specify the number of reducers |
-mapdebug | Optional | Script to call when map task fails |
-reducedebug | Optional | Script to call when reduce task fails |