GitHub - fdouetteau/PyMapReduce: Simple MapReduce implementation in Python, for text file parallel processing

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
pymapreduce		pymapreduce
MANIFEST.in		MANIFEST.in
README.txt		README.txt
setup.py		setup.py

Repository files navigation


Py MapReduce
=================

Py MapReduce is a simple monoserver implementation of MapReduce in python, using the multiprocessing module. 
It can be use for instance for quick parallelization of file processing task, e.g. performing operations on each line of a large file.  
Simple operations (regexp matching etc..) are hard to multithread in python because of the Global Interpreter Lock (http://wiki.python.org/moin/GlobalInterpreterLock). Here multiprocessing can help
 
Sample job (Word Count)
------------

	class WC(Job):
	    "Sample Word count parallel implementation"
	    lc = 0
	    wc = 0 
	    bc = 0
	    def __init__(self, f):
	        self.file = f
    
	    def reduce_start(self):
	        self.lc = 0
	        self.wc = 0
	        self.bc = 0 
        
	    def enumerate(self):
	        return enumerate(open(self.file))
        
	    def map(self, pos, item):
	        return (pos, (1, len(item.split()), len(item)))
        
	    def reduce(self, pos, r):
	        (lc, wc, bc) = r
	        self.lc = self.lc + lc
	        self.wc = self.wc + wc
	        self.bc = self.bc + bc
                
	    def reduce_stop(self):
	        return (self.lc, self.wc, self.bc)