Skip to content

don9z/pydoop

Repository files navigation

A small framework for hadoop streaming by Python

runner.py simplifies mapreduce code as well as its unit test written by Python.

Please refer to the wordcount example.

pydoop.py is a wrapper for hadoop cli, which can be used to do file operations and run mapreduce jobs easily, for example:

# list files
./pydoop.py ls .

# copy file from local to hadoop cluster
./pydoop.py put readme.md .

# test wordcount job locally
./pydoop.py start -t readme.md output wordcount_mapper.py wordcount_reducer.py

# run wordcount job in hadoop cluster
./pydoop.py start readme.md output wordcount_mapper.py wordcount_reducer.py runner.py

# remove file from cluster
./pydoop.py rm readme.md

Please refer to this blog post for detail.

About

A python hadoop streaming framework

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages