转载自:http://stackoverflow.com/questions/13331722/how-to-sort-numerically-in-hadoops-shuffle-sort-phase
Assuming you are using Hadoop Streaming, you need to use the KeyFieldBasedComparator class.
- -D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator should be added to streaming command
- You need to provide type of sorting required using mapred.text.key.comparator.options. Some useful ones are -n : numeric sort, -r : reverse sort
EXAMPLE :
Create an identity mapper and reducer with the following code
This is the mapper.py & reducer.py
#!/usr/bin/env python import sys for line in sys.stdin: print "%s" % (line.strip())
注:其实也可以用cat实现:-)
This is the input.txt
1
11
2
20
7
3
40
This is the Streaming command
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \ -D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \ -D mapred.text.key.comparator.options=-n \ -input /user/input.txt \ -output /user/output.txt \ -file ~/mapper.py \ -mapper ~/mapper.py \ -file ~/reducer.py \ -reducer ~/reducer.py
And you will get the required output
1
2
3
7
11
20
40
NOTE :
- I have used a simple one key input. If however you have multiple keys and/or partitions, you will have to edit mapred.text.key.comparator.options as needed. Since I do not know your use case , my example is limited to this
- Identity mapper is needed since you will need atleast one mapper for a MR job to run.
- Identity reducer is needed since shuffle/sort phase will not work if it is a pure map only job.