python-wc

简单的wordcount python实现版本。

环境中运行联系 hadoop 集群管理员，在hadoop客户端机器上，使用以下命令提交任务：

hadoop jar /opt/cloudera/parcels/CDH-5.4.8-1.cdh5.4.8.p0.4/lib/hadoop-mapreduce/hadoop-streaming-2.6.0-cdh5.4.8.jar -input wc/wc.txt -output out-wc1/ -mapper map.py -combiner reduce.py -reducer reduce.py -file map.py -file reduce.py

需要事先将wc.txt上传到hdfs上。命令中可以指定输入文件、输出文件夹（必须不存在）、mapper/reducer实现，最后的 -file 是指定需要上传的资源，会下载到每个执行机器上。

如果不知道streaming jar包在哪里（比如集群不是你安装的），你可以用下面的命令找到

find / -name *.jar | grep streaming

如果你安装了 spark，那可以用 pyspark 达到同样的目的：

pyspark import re sc.textFile( "alan/wc/wc.txt" ).flatMap( lambda line: re.split( '\W+', line ) ).map( lambda w:(w,1) ).reduceByKey( lambda v1,v2: v1+v2 ).take(10)

Name	Name	Last commit message	Last commit date
Latest commit History 3 Commits 3 Commits
README.md	README.md
map.py	map.py
reduce.py	reduce.py
wc.txt	wc.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

python-wc

About

Uh oh!

Releases

Packages

Languages

Search code, repositories, users, issues, pull requests...

mind4s/python-wc

Folders and files

Latest commit

History

Repository files navigation

python-wc

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages