map_reduce.py

Synopsis

Multi-processed MapReduce framework for a limited main memory. Runs in a single machine and uses the file system to store all intermediate calculations. Based on Python’s multiprocessing Pool module (Python 3.4). Framework provides the following functionalities:

Low main memory consumption
Multi-processed
Multiple file support
Built-in performance configuration.
Different file format support (txt, csv, json)
Scalable and easy to use

Users can easily create MapReduce jobs using this framework by only defining map and reduce functions. All the intermediate steps are handled by the framework. Number of running processes in the framework is based on the number of input files. If the number of files is less than three, then the framework assigns 3 processes for each file by default. The number of partitioned chunk files for each file is equal to the number of files.

Purely implemented in Python.
All the partition methods support stream parsing which means that the framework doesn’t. need to hold the whole TXT, JSON or CSV representation in textual form in memory.
Uses Python’s multiprocessing Pool module for the best concurrency support.
All the intermediate results are stored in physical files avoiding the main memory consumption.
Uses Python’s pickle module for serializing and de-serializing Python objects to a byte stream.

Motivation

There are many MapReduce implementations in different programming languages. One of the most popular implementation is the Hadoop which hides all the details of concurrent programming from the users making it super easy to create and run MapReduce jobs. Even though there are many available MR frameworks nowadays, only few of them are focused on running MapReduce jobs in a limited memory space. Most existing MR frameworks are designed to run on large number of clusters powered with very high CPUs providing a great performance on huge datasets.
The goal and the main objective of this project is to propose and implement a MapReduce framework that can work in a single machine with a limited main memory.

Installation

Step 1:
Get the zip of the project from here.

Step 2:
Once you have the source files you can start creating and running your MapReduce jobs. It's super simple.

Import the framework module.

from map_reduce import *

Create a new class that defines your mapper and reducer functions.

Mapper function must take a file and return a list of (key, value) pairs. Each (key, value) must be a tuple.

Reducer takes a list of (key, [values]) pairs. All values are already grouped by key in the framework. Reducer returns a list of (key, value) pairs.

Step 3:
Once you created your class, make it to extend MapReduceInterface class.

class YourMRClass(MapReduceInterface):
	def __init__(self, files):
		self.files = files
		mapper = self.mapper
		reducer = self.reducer
		MapReduceInterface.__init__(self,  mapper, reducer, files)

Step 4:

Create an object of your class and run run_program() method on it.

your_mr_class_obj = YourMRClass(input_files)
your_mr_class_obj.run_program()

Step 5:

Now relax! The framework will take care of the rest and the output will be generated to map_reduce_output.txt file.

The example below is a full implementation of a class the uses the framework to create a MapReduce job. This class finds the number of occurances of each word in all files using our framework.

# You have to import the framework first.
from map_reduce import *

# Define your class that inherits from MapReduceInterface.
# This means that your class object is a MapReduce job.
class WordCount(MapReduceInterface):

	def __init__(self, files):
		self.files = files
		mapper = self.mapper
		reducer = self.reducer
		# Calling the super class constructor.
		MapReduceInterface.__init__(self,  mapper, reducer, files)

	"""
	Gets a file chunk. 
	Returns [(key, value)].
	"""
	def mapper(self, file_chunk):
		text = file_chunk.read()
		results = []
		text = text.split()
		for w in text:
			results.append((w,1))
		return results


	"""
	Gets [(word1, [counts]),  (word2, [counts2]), ...). 
	Returns [(word1, total_freq), (b, total_freq), ...].
	"""
	def reducer(self, key_values_list):
		result = []
		for entry in key_values_list:
			key = entry[0]
			sum_of_values = sum(entry[1])
			result.append((key,sum_of_values))
		return result

All the framework execution logs are written to map_reduce.log file.

API Reference

`MapReduceInterface` class extends MapReduce.

Instance Variables	Methods
mapper	__files_not_empty
reducer	__map_and_reduce_are_fine
files	__is_file_format_supported
num_processes	__the_same_format_files
	__check_file_names
	__set_num_processes
	__cleanup
	__finalize_program
	run_program
	shuffle
	run_program
	call_map_reduce
	merge_reduce_results

`MapReduce` class

Instance Variables	Methods
mapper	get_filename
reducer	get_file_extension
file	apply_map
num_processes	apply_reduce
file_ext	apply_intermediate
	partition
	__partition_text
	__partition_csv
	__partition_json
	__load
	split_list
	run_master

`NoDaemonProcess` class extends multiprocessing.Process.

Instance Variables	Methods
	_get_daemon
	_set_daemon

`MyMRPool` class extends multiprocessing.pool.Pool.

"""
Class MyMRPool.

MyMRPool is a class that can create a non-daemonic pool of processes. 
We sub-class multiprocessing.pool.Pool instead of multiprocessing.Pool
because the latter is only a wrapper function, not a proper class.
"""
class MyMRPool(multiprocessing.pool.Pool):
	Process = NoDaemonProcess

`MapReduceError` class extends Exception.

Instance Variables	Methods
message	init
errors

Tests

TODO: Write unit tests for the framework.

Contributors

Tigran Hakobyan https://twitter.com/tiggreen

License

The MIT License

Copyright (c) Tigran Hakobyan. http://tiggreen.me

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.gitignore		.gitignore
CHANGELOG		CHANGELOG
LICENSE		LICENSE
README.md		README.md
index.html		index.html
map_reduce.py		map_reduce.py
numeric_ops.py		numeric_ops.py
runner.py		runner.py
twitter_ops.py		twitter_ops.py
word_count.py		word_count.py

License

tiggreen/map_reduce.py

Folders and files

Latest commit

History

Repository files navigation

map_reduce.py

Synopsis

Motivation

Installation

API Reference

MapReduceInterface class extends MapReduce.

MapReduce class

NoDaemonProcess class extends multiprocessing.Process.

MyMRPool class extends multiprocessing.pool.Pool.

MapReduceError class extends Exception.

Tests

Contributors

License

About

Resources

License

Stars

Watchers

Forks

Languages

`MapReduceInterface` class extends MapReduce.

`MapReduce` class

`NoDaemonProcess` class extends multiprocessing.Process.

`MyMRPool` class extends multiprocessing.pool.Pool.

`MapReduceError` class extends Exception.