-
Notifications
You must be signed in to change notification settings - Fork 0
Fork of mincemeat.py lightweight Python Map-Reduce framework
License
ArkarNyuntWai/mincemeatpy
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
mincemeat.py: MapReduce on Python mincemeat.py is licensed under the MIT license. See LICENSE for more details REQUIREMENTS Python 2.5 (2.6+ is best). 2.4 doesn't have a compatible hmac/hashlib/sha1 DESCRIPTION The Map/Reduce Transaction applies 2 functions over a datasource of the form: {name1: corpus1, ..., nameN: corpusN} roughly corresponding to: results = reduce( map( datasource )) ultimately resulting in results of the form: {key1: value, key2: value, ..., keyN: value} First, the Server distributes each (name: corpus) across all available Clients to run the Map function. Then, once all Map results are collected and merged by key, the Server distributes each (key, [value, ...]) to all avaiable Clients to run the Reduce function. Finally, all the Reduce results are collected and merged by key, to produce the final results. The Map Phase (distributed accross all clients), takes the data and produces: (name1, corpus1) ==> (key1, [value, ...]), ..., (keyN, [value, ...]) ... (nameN, corpusN) ==> (key1, [value, ...]), ..., (keyN, [value, ...]) (Server collects all key1 values, ..., keyN values) The Reduce Phase (distributed accross all clients), takes the combined data from all of the Map phase results, and sends the data for one key to each Client: (key1, [value, ...]) ==> (key1, result) ... (keyN, [value, ...]) ==> (keyN, result) The Server collects all keys, produces the results dictionary: {key1: result, ..., keyN: result} All communication between Server and Clients occurs asynchronously using the standard Python asyncore/asynchat library, and may be interspersed with user-defined communication in either direction, either from within the application's asynchronous communication thread, or from another thread. Tools are provided to implement arbitrary synchronous events such as heartbeats, health checking, and soft or hard communications shutdown. API compatibility with Michael Fairley's original mincemeat.py from has been largely maintained: http://remembersaurus.com/mincemeatpy/ EXAMPLES Run the basic example with: python example.py and a worker with: python mincemeat.py -p changeme [server address (default "localhost")] The example-sf-... examples require the Gutenberg SF CD, or some other set of big *.txt files, in: ../Gutenberg SF CD/Gutenberg SF/*.txt Download the Gutenberg SF CD using Bit Torrent, from: http://www.gutenberg.org/cdproject/pgsfcd-032007.zip.torrent Unzip the ISO, mount it, and copy (or symbolic link, on Posix platforms) it in the directory above where you've cloned the mincemeatpy repository; name it "Gutenberg SF CD". * Skipping the Reduce Phase The first Big Corpus example illustrates replacing a trivial Reduce phase with a "finish" function executed in the server node. Start a server (binds to all available interfaces): python example-sf-maponly.py and worker(s) with: python mincemeat.py -p changeme [server address (default "localhost") * Dynamically Electing a Server The next examples illustrate various means of dynamically spawning a server, if necessary. They all bind to "localhost" by default (change addr_info in the text to specify another server host interface name). Each instance run elects itself to be the server, if necessary, or becomes a client. Start multiple instances on the same host: python example-sf-election.py and workers(s) with: python example-sf-election.py or python mincemeat.py -v -p changeme * Multithreaded Clients/Servers The example-sf-masterless.py example runs both a client thread and a server thread (if necessary), in the same instance, to illustrate how one might build a dynamic, robust Map-Reduce grid, when we don't have control over the lifespan of the instances (eg. a web server farm, ...). python example-sf-masterless.py This will run a client and a server! To speed it up, create more clients by running the same command. Another implementation with similar functionality, but using a better threading based daemon implementation for the Client and Server is: python example-sf-daemon.py * Client-driven REPL for submitting Map/Reduce tasks A much more complex example allows multiple Clients to be established, each of which may independently submit Map/Reduce Transaction to the (automatically created) Server. A prompt is printed on the console, allowing a file name pattern (including "glob" wildcards) to be specified, and the matching files are word-counted. python example-sf-repr.py USAGE Create an instance of mincemeat.Server in your application, and set some attributes specifying the desired Map and Reduce functions and a datasource, and call the Server.run_server(). Start Client node(s); unless your application specifies custom Client.mapfn implementations, a generic client may be started on any node, by running: python mincemeat.py -p "changeme" [hostname] where the password "changeme" corresponds to the one specified in your Server implementation, and hostname is the network interface your Server binds to (default is localhost, so by default all Clients would need to run on the same host). They will eacho connect to the Server, receive their Map/Reduce function configurations, and begin receiving Map and then Reduce tasks. Finally, the Server will print the results. See example*.py for details. * datasource Present your input data in the dict-like form, where one or more keys each index a corpus of data. You may use a dict, or any other data structure which implements the iterator protocol (__iter__() and next()) for returning all keys, and indexing (__getitem__() to return the data for each key. Each key must be unique, as it is used in mincemeat.py for housekeeping. * Map Phase: mapfn, collectfn The Map function must accept 2 parameters, representing a name and the corpus of data. It must return an iterator (or a generator) which yields (key, value) tuples from the data, which will be returned (as a dictionary) as the results of the Map phase. The Map function is unique amongst the functions that the user must provide; it is the only one that must be implemented as an iterator returning (key, value) pairs. For example, here's a mapfn that implements word counting (k is a filename, v is a corpus of text): def get_lower_simple( k, v ): for w in v.split(): yield w.lower(), 1 The Map phase always returns a dictionary where each key indexes a list [value, value, ...] of all values that where produced with that key. So, at the end of the map phase, we'd see: {'a': [1, 1, ..., 1], 'and': [1, 1, ..., 1], ...} returned to the server. This works, but is sub-optimal for very large jobs; we might want to post-process the Map data in the client. This is where the collectfn is used. ** Specialized Map/Reduce Function Configuration Normally, you will specify 'mapfn' as a plain function in your Server code (not a class or instance method), and you will either specify it as a Server attribute: def mapfn(k, v): ... s = mincemeat.Server() s.mapfn = mapfn ... s.conn(...) or, you may supply it in the Server.conn(credentials=...) package: s = mincemeat.Server() credentials = { 'password': "changeme", 'mapfn': mapfn, ... } s.conn(**credentials) To avoid performing the Map or the Reduce phase altogether, you may specify None for mapfn or reducefn (but NOT both!): s = mincemeat.Server() s.mapfn = None # Don't perform Map; skip, and send data to Reduce or: s.reducefn = None # Don't perform Reduce; send Map output to results This will result in no mapfn (or reducefn) being transmitted to the Client, and no Map (or Reduce) tasks being generated; the output of the datasource will be immediately sent to the (remaining) Map or Reduce phase. Sometimes, however, your Map and/or Reduce functions are simply too complex to serialize and transmit from the Server to the (generic) Clients in the standard way. If so, you may specify a fully custom mapfn directly as a method of your derived mincemeat.Client class implementation (of course, you won't be able to run default "python mincemeat.py ..." clients!) In your Server code: s = mincemeat.Server() s.mapfn = False # Do Map, to don't transmit a 'mapfn' to Client! s.conn(...) and in your custom Client implementation: class my_client(mincemeat.Client): def mapfn(self, k, v): ... Now, you may access any custom data or methods in your Client mapfn() required for processing Map tasks. ** collectfn (optional) If the (optional) collectfn is provided, it is applied (in the client) after the mapfn is done producing the raw Map (key,value) data. The same function signatures are allowed as those used in the Reduce phase; a simple function taking key, [value, value], and returning a single value (just like reducefn and finishfn): def sum_values( k, vs ): return sum( vs ) Or, it may be implemented as an iterator/generator taking (key, [value, ...]) and yielding (key, value): def sum_values_generator( kvi ): for k, vs in kvi: yield k, sum( vs ) And for consistency, we allow collectfn to also return (key, [value, ...]) pairs: def sum_values_long( kvi ): for k, vs in kvi: yield k, vs if len(vs) < 10 else [sum(vs)] The result of the Map phase is always (key, [value, ...]); the collectfn is allowed to return single values or lists; we will always wrap any non-list collectfn result value as [value]. * Reduce Phase: reducefn, finishfn In the Reduce phase, the the Server send each key,[value, ...] to a Client, which reduces it to a key,value result and sends it back. If this is trivial, you may decide to skip the Reduce phase altogether; just leave reducefn set to None. We allow any function taking key,[value,...] and returning value: function(k, vs) ==> v or taking an iterator yielding key,[value,...], and itself returing an iterable yielding key,value: function(iter([(k, vs), ...])) ==> [(k, v), ...] ** finishfn (optional) After all the Reduce phase is done, the Server runs the (optional) finishfn. This function is presented all Reduce phase data for any finishing required. In fact, if the Reduce function is trivial, you may chose to run it entirely on the Server (instead of sending the Reduce phase data expensively to the Client), by using finishfn to run the Reduce function, instead of assigning it to reducefn. * Result Phase: resultfn (optional) Fired by the Server as soon as results are available. If not specified, the Server default resultfn just collects up all results for later access. A function taking a transaction ID and results dictionary is required: def resultfn(txn, results): print "# Transaction ", txn for k, v in results.iteritems(): print k, ": ", v TOPOLOGY In its most simple form, a Map-Reduce cluster has a Server and several (perhaps transient) Clients. The server configures the Client appropriately by serializing and sending the appropriate Map and Reduce functions, sends one or more Map and/or Reduce requests to the Client, collecting the results. If the Client fails without returning the result, the Server will (eventually) send the same Map/Reduce requests to another Client. When complete, the Server disconnects from all Clients and exits. In all cases, the Server is in control of the lifespan of the Clients, and also schedules all of the Map/Reduce Tasks, and all constituent Map/Reduce Transactions. +-------(C) / v (S)<--------(C) ^ \ +-----(C) In other typical uses of Map-Reduce, the Clients might be in control. For example, if you have a web server farm with many clients created spontaneously by the web server infrastructure, where any each process will be a Map-Reduce Client, and any Client may need to schedule some Map/Reduce task to be processed across all other Clients. In this case, a Server needs to be elected from among the Clients (if none exists yet); another Server may need to be re-elected, should one disappear before (or during) processing of some Client's task. Since the Server binds atomically and unilaterally to a port, any Client could spawn a thread to implement a Server when it fails to connect to an existing one. Here is an example of a Client connecting to a Server hosted as a thread within the same process: +-------(C) / v (C->(S))<--------(C) ^ \ +-----(C) After that server node fails, the Clients connect to a new one; the first one to detect the failure attempts to reconnect and then fails, and will spawn a Server thread: (C)-+ \ v (XXXXX) (C->(S)) ^ / (C)---+
About
Fork of mincemeat.py lightweight Python Map-Reduce framework
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published