Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

mind4s/python-wc

Open more actions menu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

python-wc

简单的wordcount python实现版本。

环境中运行 联系 hadoop 集群管理员,在hadoop客户端机器上,使用以下命令提交任务:

hadoop jar /opt/cloudera/parcels/CDH-5.4.8-1.cdh5.4.8.p0.4/lib/hadoop-mapreduce/hadoop-streaming-2.6.0-cdh5.4.8.jar -input wc/wc.txt -output out-wc1/ -mapper map.py -combiner reduce.py -reducer reduce.py -file map.py -file reduce.py

需要事先将wc.txt上传到hdfs上。命令中可以指定输入文件、输出文件夹(必须不存在)、mapper/reducer实现,最后的 -file 是指定需要上传的资源,会下载到每个执行机器上。

如果不知道streaming jar包在哪里(比如集群不是你安装的),你可以用下面的命令找到

find / -name *.jar | grep streaming

如果你安装了 spark,那可以用 pyspark 达到同样的目的:

pyspark import re sc.textFile( "alan/wc/wc.txt" ).flatMap( lambda line: re.split( '\W+', line ) ).map( lambda w:(w,1) ).reduceByKey( lambda v1,v2: v1+v2 ).take(10)

About

python word count

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

Morty Proxy This is a proxified and sanitized view of the page, visit original site.