Canopy Clustering using MapReduce in Hadoop

=

Files Included:

Gen.py
Stage 1: Canopy Center

mapperStg1.py

reducerStg1.py

Stage 2: Canopy Assign

mapperStg2.py

reducerStg2.py

Stage 3: Cluster Center

mapperStg3.py

reducerStg3.py

Stage 4: Cluster Assign:

mapperStg4.py

reducerStg4.py

*Functions of each of the files will be updated at a later date.

Description of the files:

=

Gen.py

-> Generates the Data Set on which we use Canopy-Clustering. -> Generates a set of k-Centroids.

DataPoint.py

-> DataPoint class.

Stage 1: Canopy Center

Mapper:

Input: Data points.

Output: List of Canopy Centers.

Function:

Reducer:

Input: Canopy Centers

Output: Canopy Centers

Function:

Stage 2: Canopy Assign

Mapper:

Input: Canopy Centers

Output: Canopy Centers and the Data Points that belong to each.

Function:

Reducer:

Input: Canopy Centers, Data Points (stdin)

Output: Identity

Function: Echos the result from the Mapper.

Stage 3: Cluster Center

Mapper:

Input:

-> List of 'k' Centroids

-> List of Canopy Centers

-> Canopy Centers, Data Points (stdin)

Output: K Centroids and the Data Points that belong to each.

Function:

Reducer:

Input:

Output:

Function:

Stage 4: Cluster Assign

Mapper:

Input:

Output:

Function:

Reducer:

Input:

Output:

Function:

To replicate running:

Edit the run.sh shell script to run.

Note:

If running on windows cmd, you have to create your own Sort function to sort input from the mapper. Personally, I'd recommend just using a linux OS to smoothen it all out.

Project Members (Alphabetically):

Archit Shukla

Raj Kiran

Sheraaz Jason

Website:

Canopy Clustering in Python using Hadoop (Map Reduce)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Canopy Clustering using MapReduce in Hadoop

Files Included:

Description of the files:

Gen.py

DataPoint.py

Stage 1: Canopy Center

Stage 2: Canopy Assign

Stage 3: Cluster Center

Stage 4: Cluster Assign

Project Members (Alphabetically):

Website:

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name	Name	Last commit message	Last commit date
Latest commit History 41 Commits 41 Commits
.gitignore	.gitignore
DataPoint.py	DataPoint.py
README.md	README.md
compareCentroids.py	compareCentroids.py
gen.py	gen.py
mapperStg1.py	mapperStg1.py
mapperStg2.py	mapperStg2.py
mapperStg3.py	mapperStg3.py
mapperStg4.py	mapperStg4.py
oldrun.sh	oldrun.sh
reducerStg1.py	reducerStg1.py
reducerStg2.py	reducerStg2.py
reducerStg3.py	reducerStg3.py
reducerStg4.py	reducerStg4.py
run.sh	run.sh

Search code, repositories, users, issues, pull requests...

bread-tan/canopyClusteringPython

Folders and files

Latest commit

History

Repository files navigation

Canopy Clustering using MapReduce in Hadoop

Files Included:

Description of the files:

Gen.py

DataPoint.py

Stage 1: Canopy Center

Stage 2: Canopy Assign

Stage 3: Cluster Center

Stage 4: Cluster Assign

Project Members (Alphabetically):

Website:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages