å¨è¿ä¸ªå®ä¾ä¸ï¼æå°ä¼å大家ä»ç»å¦ä½ä½¿ç¨Python 为 Hadoopç¼åä¸ä¸ªç®åçMapReduce
ç¨åºã
尽管Hadoop æ¡æ¶æ¯ä½¿ç¨Javaç¼åçä½æ¯æ们ä»ç¶éè¦ä½¿ç¨åC++ãPythonçè¯è¨æ¥å®ç°Hadoopç¨åºã尽管Hadoopå®æ¹ç½ç«ç»ç示ä¾ç¨åºæ¯ä½¿ç¨Jythonç¼å并æå
æJaræ件ï¼è¿æ ·æ¾ç¶é æäºä¸ä¾¿ï¼å
¶å®ï¼ä¸ä¸å®éè¦è¿æ ·æ¥å®ç°ï¼æ们å¯ä»¥ä½¿ç¨Pythonä¸Hadoop å
³èè¿è¡ç¼ç¨ï¼ççä½äº/src/examples/python/WordCount.py çä¾åï¼ä½ å°äºè§£å°æå¨è¯´ä»ä¹ã
æ们æ³è¦åä»ä¹ï¼
æ们å°ç¼åä¸ä¸ªç®åç MapReduce ç¨åºï¼ä½¿ç¨çæ¯C-Pythonï¼èä¸æ¯Jythonç¼ååæå
æjarå
çç¨åºã
æ们çè¿ä¸ªä¾åå°æ¨¡ä»¿ WordCount 并使ç¨Pythonæ¥å®ç°ï¼ä¾åéè¿è¯»åææ¬æ件æ¥ç»è®¡åºåè¯çåºç°æ¬¡æ°ãç»æä¹ä»¥ææ¬å½¢å¼è¾åºï¼æ¯ä¸è¡å
å«ä¸ä¸ªåè¯ååè¯åºç°ç次æ°ï¼ä¸¤è
ä¸é´ä½¿ç¨å¶è¡¨ç¬¦æ¥æ³é´éã
å
å³æ¡ä»¶
ç¼åè¿ä¸ªç¨åºä¹åï¼ä½ å¦è¦æ¶è®¾å¥½Hadoop é群ï¼è¿æ ·æè½ä¸ä¼å¨åæå·¥ä½æçãå¦æä½ æ²¡ææ¶è®¾å¥½ï¼é£ä¹å¨åé¢æ个ç®ææç¨æ¥æä½ å¨Ubuntu Linux ä¸æ建ï¼åæ ·éç¨äºå
¶ä»åè¡çlinuxãunixï¼
å¦ä½ä½¿ç¨Hadoop Distributed File System (HDFS)å¨Ubuntu Linux 建ç«åèç¹ç Hadoop é群
å¦ä½ä½¿ç¨Hadoop Distributed File System (HDFS)å¨Ubuntu Linux 建ç«å¤èç¹ç Hadoop é群
PythonçMapReduce代ç
使ç¨Pythonç¼åMapReduce代ç çæ巧就å¨äºæ们使ç¨äº HadoopStreaming æ¥å¸®å©æ们å¨Map å Reduceé´ä¼ éæ°æ®éè¿STDIN (æ åè¾å
¥)åSTDOUT (æ åè¾åº).æ们ä»
ä»
使ç¨Pythonçsys.stdinæ¥è¾å
¥æ°æ®ï¼ä½¿ç¨sys.stdoutè¾åºæ°æ®ï¼è¿æ ·åæ¯å 为HadoopStreamingä¼å¸®æ们å好å
¶ä»äºãè¿æ¯ççï¼å«ä¸ç¸ä¿¡ï¼
Map: mapper.py
å°ä¸åç代ç ä¿åå¨/home/hadoop/mapper.pyä¸ï¼ä»å°ä»STDIN读åæ°æ®å¹¶å°åè¯æè¡åéå¼ï¼çæä¸ä¸ªå表æ å°åè¯ä¸åç次æ°çå
³ç³»ï¼
注æï¼è¦ç¡®ä¿è¿ä¸ªèæ¬æ足å¤æéï¼chmod +x /home/hadoop/mapper.pyï¼ã
#!/usr/bin/env python
import sys
# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words
words = line.split()
# increase counters
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
print '%s\\t%s' % (word, 1)å¨è¿ä¸ªèæ¬ä¸ï¼å¹¶ä¸è®¡ç®åºåè¯åºç°çæ»æ°ï¼å®å°è¾åº "<word> 1" è¿
éå°ï¼å°½ç®¡<word>å¯è½ä¼å¨è¾å
¥ä¸åºç°å¤æ¬¡ï¼è®¡ç®æ¯çç»åæ¥çReduceæ¥éª¤ï¼æå«åç¨åºï¼æ¥å®ç°ãå½ç¶ä½ å¯ä»¥æ¹åä¸ç¼ç é£æ ¼ï¼å®å
¨å°éä½ çä¹ æ¯ã
Reduce: reducer.py
å°ä»£ç åå¨å¨/home/hadoop/reducer.py ä¸ï¼è¿ä¸ªèæ¬çä½ç¨æ¯ä»mapper.py çSTDINä¸è¯»åç»æï¼ç¶å计ç®æ¯ä¸ªåè¯åºç°æ¬¡æ°çæ»åï¼å¹¶è¾åºç»æå°STDOUTã
åæ ·ï¼è¦æ³¨æèæ¬æéï¼chmod +x /home/hadoop/reducer.py
#!/usr/bin/env python
from operator import itemgetter
import sys
# maps words to their counts
word2count = {}
# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# parse the input we got from mapper.py
word, count = line.split('\\t', 1)
# convert count (currently a string) to int
try:
count = int(count)
word2count[word] = word2count.get(word, 0) + count
except ValueError:
# count was not a number, so silently
# ignore/discard this line
pass
# sort the words lexigraphically;
#
# this step is NOT required, we just do it so that our
# final output will look more like the official Hadoop
# word count examples
sorted_word2count = sorted(word2count.items(), key=itemgetter(0))
# write the results to STDOUT (standard output)
for word, count in sorted_word2count:
print '%s\\t%s'% (word, count)
æµè¯ä½ ç代ç ï¼cat data | map | sort | reduceï¼
æå»ºè®®ä½ å¨è¿è¡MapReduce jobæµè¯åå°è¯æå·¥æµè¯ä½ çmapper.py å reducer.pyèæ¬ï¼ä»¥å
å¾ä¸å°ä»»ä½è¿åç»æ
è¿éæä¸äºå»ºè®®ï¼å
³äºå¦ä½æµè¯ä½ çMapåReduceçåè½ï¼
ââââââââââââââââââââââââââââââââââââââââââââââ
\r\n
# very basic test
hadoop@ubuntu:~$ echo "foo foo quux labs foo bar quux" | /home/hadoop/mapper.py
foo 1
foo 1
quux 1
labs 1
foo 1
bar 1
ââââââââââââââââââââââââââââââââââââââââââââââ
hadoop@ubuntu:~$ echo "foo foo quux labs foo bar quux" | /home/hadoop/mapper.py | sort | /home/hadoop/reducer.py
bar 1
foo 3
labs 1
ââââââââââââââââââââââââââââââââââââââââââââââ
# using one of the ebooks as example input
# (see below on where to get the ebooks)
hadoop@ubuntu:~$ cat /tmp/gutenberg/20417-8.txt | /home/hadoop/mapper.py
The 1
Project 1
Gutenberg 1
EBook 1
of 1
[...]
(you get the idea)
quux 2
quux 1
ââââââââââââââââââââââââââââââââââââââââââââââ
å¨Hadoopå¹³å°ä¸è¿è¡Pythonèæ¬
为äºè¿ä¸ªä¾åï¼æ们å°éè¦ä¸ç§çµå书ï¼
The Outline of Science, Vol. 1 (of 4) by J. Arthur Thomson\r\n
The Notebooks of Leonardo Da Vinci\r\n
Ulysses by James Joyce
ä¸è½½ä»ä»¬ï¼å¹¶ä½¿ç¨us-asciiç¼ç åå¨ è§£ååçæ件ï¼ä¿åå¨ä¸´æ¶ç®å½ï¼æ¯å¦/tmp/gutenberg.
hadoop@ubuntu:~$ ls -l /tmp/gutenberg/
total 3592
-rw-r--r-- 1 hadoop hadoop 674425 2007-01-22 12:56 20417-8.txt
-rw-r--r-- 1 hadoop hadoop 1423808 2006-08-03 16:36 7ldvc10.txt
-rw-r--r-- 1 hadoop hadoop 1561677 2004-11-26 09:48 ulyss12.txt
hadoop@ubuntu:~$
å¤å¶æ¬å°æ°æ®å°HDFS
å¨æ们è¿è¡MapReduce job åï¼æ们éè¦å°æ¬å°çæ件å¤å¶å°HDFSä¸ï¼
hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -copyFromLocal /tmp/gutenberg gutenberg
hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls
Found 1 items
/user/hadoop/gutenberg <dir>
hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls gutenberg
Found 3 items
/user/hadoop/gutenberg/20417-8.txt <r 1> 674425
/user/hadoop/gutenberg/7ldvc10.txt <r 1> 1423808
/user/hadoop/gutenberg/ulyss12.txt <r 1> 1561677
æ§è¡ MapReduce job
ç°å¨ï¼ä¸ååå¤å°±ç»ªï¼æ们å°å¨è¿è¡Python MapReduce job å¨Hadoopé群ä¸ãåæä¸é¢æ说çï¼æ们使ç¨çæ¯
HadoopStreaming 帮å©æä»¬ä¼ éæ°æ®å¨MapåReduceé´å¹¶éè¿STDINåSTDOUTï¼è¿è¡æ ååè¾å
¥è¾åºã
hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-0.19.1-streaming.jar
-mapper /home/hadoop/mapper.py -reducer /home/hadoop/reducer.py -input gutenberg/*
-output gutenberg-output
å¨è¿è¡ä¸ï¼å¦æä½ æ³æ´æ¹Hadoopçä¸äºè®¾ç½®ï¼å¦å¢å Reduceä»»å¡çæ°éï¼ä½ å¯ä»¥ä½¿ç¨â-jobconfâé项ï¼
hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-0.19.1-streaming.jar
-jobconf mapred.reduce.tasks=16 -mapper ...
ä¸ä¸ªéè¦çå¤å¿æ¯å
³äºHadoop does not honor mapred.map.tasks
è¿ä¸ªä»»å¡å°ä¼è¯»åHDFSç®å½ä¸çgutenberg并å¤çä»ä»¬ï¼å°ç»æåå¨å¨ç¬ç«çç»ææ件ä¸ï¼å¹¶åå¨å¨HDFSç®å½ä¸ç
gutenberg-outputç®å½ã
ä¹åæ§è¡çç»æå¦ä¸ï¼
hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-0.19.1-streaming.jar
-mapper /home/hadoop/mapper.py -reducer /home/hadoop/reducer.py -input gutenberg/*
-output gutenberg-output
additionalConfSpec_:null
null=@@@userJobConfProps_.get(stream.shipped.hadoopstreaming
packageJobJar: [/usr/local/hadoop-datastore/hadoop-hadoop/hadoop-unjar54543/]
[] /tmp/streamjob54544.jar tmpDir=null
[...] INFO mapred.FileInputFormat: Total input paths to process : 7
[...] INFO streaming.StreamJob: getLocalDirs(): [/usr/local/hadoop-datastore/hadoop-hadoop/mapred/local]
[...] INFO streaming.StreamJob: Running job: job_200803031615_0021
[...]
[...] INFO streaming.StreamJob: map 0% reduce 0%
[...] INFO streaming.StreamJob: map 43% reduce 0%
[...] INFO streaming.StreamJob: map 86% reduce 0%
[...] INFO streaming.StreamJob: map 100% reduce 0%
[...] INFO streaming.StreamJob: map 100% reduce 33%
[...] INFO streaming.StreamJob: map 100% reduce 70%
[...] INFO streaming.StreamJob: map 100% reduce 77%
[...] INFO streaming.StreamJob: map 100% reduce 100%
[...] INFO streaming.StreamJob: Job complete: job_200803031615_0021
[...] INFO streaming.StreamJob: Output: gutenberg-output hadoop@ubuntu:/usr/local/hadoop$
æ£å¦ä½ æè§å°çä¸é¢çè¾åºç»æï¼Hadoop åæ¶è¿æä¾äºä¸ä¸ªåºæ¬çWEBæ¥å£æ¾ç¤ºç»è®¡ç»æåä¿¡æ¯ã
å½Hadoopé群å¨æ§è¡æ¶ï¼ä½ å¯ä»¥ä½¿ç¨æµè§å¨è®¿é®
http://localhost:50030/ ï¼å¦å¾ï¼
æ£æ¥ç»ææ¯å¦è¾åºå¹¶åå¨å¨HDFSç®å½ä¸çgutenberg-outputä¸ï¼
hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls gutenberg-output
Found 1 items
/user/hadoop/gutenberg-output/part-00000 <r 1> 903193 2007-09-21 13:00
hadoop@ubuntu:/usr/local/hadoop$
å¯ä»¥ä½¿ç¨dfs -cat å½ä»¤æ£æ¥æ件ç®å½
hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -cat gutenberg-output/part-00000
"(Lo)cra" 1
"1490 1
"1498," 1
"35" 1
"40," 1
"A 2
"AS-IS". 2
"A_ 1
"Absoluti 1
[...]
hadoop@ubuntu:/usr/local/hadoop$
注ææ¯è¾åºï¼ä¸é¢ç»æç(")符å·ä¸æ¯Hadoopæå
¥çã
转载ä»
ä¾åèï¼çæå±äºåä½è
ãç¥ä½ æå¿«ï¼æ»¡æ请é纳å¦