Mapreduce实现第一个wordcount
1、数据
[root@master mapreduce_wordcount_python]# ls
1.data map_new.py red_new.py run.sh The_Man_of_Property.txt
数据是一篇文章
[root@master mapreduce_wordcount_python]# head The_Man_of_Property.txt
Preface
“The Forsyte Saga” was the title originally destined for that part of it which is called “The Man of Property”; and to adopt it for the collected chronicles of the Forsyte family has indulged the Forsytean tenacity that is in all of us. The word Saga might be objected to on the ground that it connotes the heroic and that there is little heroism in these pages. But it is used with a suitable irony; and, after all, this long tale, though it may deal with folk in frock coats, furbelows, and a gilt-edged period, is not devoid of the essential heat of conflict. Discounting for the gigantic stature and blood-thirstiness of old days, as they have come down to us in fairy-tale and legend, the folk of the old Sagas were Forsytes, assuredly, in their possessive instincts, and as little proof against the inroads of beauty and passion as Swithin, Soames, or even Young Jolyon. And if heroic figures, in days that never were, seem to startle out from their surroundings in fashion unbecoming to a Forsyte of the Victorian era, we may be sure that tribal instinct was even then the prime force, and that “family” and the sense of home and property counted as they do to this day, for all the recent efforts to “talk them out.”
。。。。。。
2、map代码
[root@master mapreduce_wordcount_python]# vim map_new.py#!/usr/bin/python
#-*-coding:utf-8-*-
import sys
import time
#引入模块for line in sys.stdin:
#循环从标准输入读取数据,按行读取#time.sleep(1000)ss = line.strip().split(' ')#定义一个变量数组,strip是避免输入的字符串前后的回车,或者隐含字符、特殊字符,一
般字符串都要strip一下,split是分隔方式,这里是按空格分隔for word in ss:#对数组进行遍历print '\t'.join([word.strip(), '1'])#print "%s\t%s"%(word,1)#对word标志为1,进行输出
3、red代码
[root@master mapreduce_wordcount_python]# vim red_new.py #-*-coding:utf-8-*-
import syscur_word = None
#定义一个变量,用来记录读取的单词
sum = 0
#定义一个变量,用来记录cur_word的数量for line in sys.stdin:ss = line.strip().split('\t')# line是读取的每行数据,strip()是去掉两边的空格,split是指按照\t分隔读取的每行数
据,生成一个数组,因为map函数的输出是按照\t分隔的if len(ss) != 2:#如果分隔的数组长度不是2,就继续循环,因为从map输出的数据正常都是['xxxx', 1],分
隔出来的数组长度应该是2continue#继续for循环word, cnt = ss#把ss数组里的两个值,单词和数量分别赋值给word和cntif cur_word == None:#如果全局变量cur_word是None,就证明是第一次循环,这时cur_word还没有读取wordcur_word = word#把刚读取的word赋值给全局变量,记录一下,用来判断下次读取的是否和这个相>同if cur_word != word:#如果和全局变量不同,说明cur_word这个单词的统计已经完了 print '\t'.join([cur_word, str(sum)])#输出每个word和数量sum += int(cnt)#统计单词数量cur_word = wordsum = 0##读取不同的word,环境要清零,重新读取 print '\t'.join([cur_word, str(sum)])
#防止遗落读取\输出
读取word过程:
逐行读取word,cur_word=word;
遇到重复的word,sum+1;
当再读取不同的word时,如读取完了word1,读取到word2,那环境就要重新回到初始状态,cur_word=word,sum=0
4、run代码
[root@master mapreduce_wordcount_python]# vim run.sh HADOOP_CMD="/usr/local/hadoop-2.8.4/bin/hadoop"
STREAM_JAR_PATH="/usr/local/hadoop-2.8.4/share/hadoop/tools/lib/hadoop-streaming-2.8.4.jar"INPUT_FILE_PATH_1="/The_Man_of_Property.txt"
#INPUT_FILE_PATH_1="/1.data"
OUTPUT_PATH="/output"$HADOOP_CMD fs -rm -r -skipTrash $OUTPUT_PATH# Step 1.
$HADOOP_CMD jar $STREAM_JAR_PATH \-input $INPUT_FILE_PATH_1 \-output $OUTPUT_PATH \-mapper "python map_new.py" \-reducer "python red_new.py" \-file ./map_new.py \-file ./red_new.py
5、本地测试
[root@master mapreduce_wordcount_python]# cat The_Man_of_Property.txt | python map_new.py | sort -k1 | python red_new.py > result.local
[root@master mapreduce_wordcount_python]# wc -l result.local
16984 result.local
[root@master mapreduce_wordcount_python]# cat result.local | sort -k2 -rn | head
the 5144
of 3407
to 2782
and 2573
a 2543
he 2139
his 1912
was 1702
in 1694
had 1526
6、集群测试
先把数据上传到hdfs
[root@master mapreduce_wordcount_python]# hadoop fs -put The_Man_of_Property.txt
[root@master mapreduce_wordcount_python]# bash run.sh
[root@master mapreduce_wordcount_python]# hadoop fs -ls /output
Found 2 items
-rw-r--r-- 3 root supergroup 0 2018-07-04 17:27 /output/_SUCCESS
-rw-r--r-- 3 root supergroup 183609 2018-07-04 17:27 /output/part-00000
[root@master mapreduce_wordcount_python]# hadoop fs -get /output/part-00000 .
[root@master mapreduce_wordcount_python]# cat part-00000 | sort -k2 -rn | head
the 5144
of 3407
to 2782
and 2573
a 2543
he 2139
his 1912
was 1702
in 1694
had 1526