def map(name, mapper, chunk_size=None, record_reader=None, combiner=None, reducer=None, **kwargs): """ With map, you can process a file stored in cloud.files in parallel. The parallelism is achieved by dividing the file specified by *name* into chunks of size *chunk_size* (bytes). Each chunk is assigned a sub job. The sub job in turn processes just that chunk, allowing for the entire file to be processed by as many cores in parallel as there are chunks. We will call this type of sub job a "mapper sub job". If chunk_size is None, it will be automatically set to 1/10th of the size of the file. Map will return a single job IDentifier (jid). The sub jobs that comprise it do not have identifiers and, therefore, cannot be accessed directly. cloud.info(jid), however, will show you information for relevant sub jobs. By default, each chunk is split into records (of 0 or more characters) using newlines as delimiters. If *record_reader* is specified as a string, each chunk is split into records using that as the delimiter. In the event a record spans across two chunks, it is guaranteed that a mapper will only be called once on the full record. In other words, we've made sure it works correctly. *mapper* is a function that takes a single argument, a record, and should return an iterable of values (a generator). In the simplest case, it can return a generator that yields only one value. Example:: def mapper(record): yield record When no *combiner* or *reducer* is specified, the return value of the cloud.files.map job will be roughly equivalent to:: map(mapper, record_reader(file_contents)) A *reducer* is a function that takes in an iterable of values and returns an iterable of values. The iterable parameter iterates through all the values returned by all the mapper(record) calls. When the reducer is specified, *reducer* will result in the creation of one additional sub job. The reducer sub job grabs the results of each mapper sub job (iterators), combines them into a single iterator, and then passes that iterator into your *reducer* function. The return value of the cloud.files.map job will be the iterator returned by the *reducer*. A *combiner*, like a *reducer*, takes in an iterable of values and returns an iterable of values. The difference is that the *combiner* is run in each mapper sub job, and each one only takes in values that were produced from the associated chunk. If a *reducer* is also specified, then the reducer sub job grabs the results of each *combiner* run in each mapper sub job. Example for counting the number of words in a document:: def wordcount_mapper(record): yield len(record.split(' ')) def wordcount_reducer(wordcounts): yield sum(wordcounts) jid = cloud.files.map('example_document', wordcount_mapper, reducer=wordcount_reducer) Result:: cloud.result(jid) >> [# of words] For advanced users, *record_reader* can also be specified as a function that takes in a file-like object (has methods read(), tell(), and seek()), and the end_byte for the current chunk. The *record_reader* should return an iterable of records. See default_record_reader for an example. Additional information exists on our blog and online documentation. Reserved special *kwargs* (see docs for details): * _cores: Set number of cores your job will utilize. See http://blog.picloud.com/2012/08/31/introducing-multicore-support/ In addition to having access to more cores, you will have _cores*RAM[_type] where _type is the _type you select Possible values depend on what _type you choose: 'c1': 1 'c2': 1, 2, 4, 8 'f2': 1, 2, 4, 8, 16 'm1': 1, 2 's1': 1 * _depends_on: An iterable of jids that represents all jobs that must complete successfully before any jobs created by this map function may be run. * _depends_on_errors: A string specifying how an error with a jid listed in _depends_on should be handled. 'abort': Set this job to 'stalled' (Default) 'ignore': Treat an error as satisfying the dependency * _env: A string specifying a custom environment you wish to run your jobs within. See environments overview at http://blog.picloud.com/2011/09/26/introducing-environments-run-anything-on-picloud/ * _fast_serialization: This keyword can be used to speed up serialization, at the cost of some functionality. This affects the serialization of both the map arguments and return values The map function will always be serialized by the enhanced serializer, with debugging features. Possible values keyword are: 0. default -- use cloud module's enhanced serialization and debugging info 1. no debug -- Disable all debugging features for arguments 2. use cPickle -- Use python's fast serializer, possibly causing PicklingErrors * _kill_process: Terminate the Python interpreter *func* runs in after *func* completes, preventing the interpreter from being used by subsequent jobs. See Technical Overview for more info. * _label: A user-defined string label that is attached to the created jobs. Labels can be used to filter when viewing jobs interactively (i.e. on the PiCloud website). * _max_runtime: Specify the maximum amount of time (in integer minutes) a job can run. If job runs beyond this time, it will be killed. * _priority: A positive integer denoting the job's priority. PiCloud tries to run jobs with lower priority numbers before jobs with higher priority numbers. * _profile: Set this to True to enable profiling of your code. Profiling information is valuable for debugging, but may slow down your job. * _restartable: In the very rare event of hardware failure, this flag indicates that the job can be restarted if the failure happened in the middle of the job. By default, this is true. This should be unset if the job has external state (e.g. it modifies a database entry) * _type: Select the type of compute resources to use. PiCloud supports four types, specified as strings: 'c1' 1 compute unit, 300 MB ram, low I/O (default) 'c2' 2.5 compute units, 800 MB ram, medium I/O 'm1' 3.25 compute units, 8 GB ram, high I/O 's1' variable compute units (2 cu max), 300 MB ram, low I/O, 1 IP per core See http://www.picloud.com/pricing/ for pricing information """ cloud_obj = _getcloud() params = cloud_obj._getJobParameters(mapper, kwargs) # takes care of kwargs file_details = _file_info(name) if not file_details['exists']: raise ValueError('file does not exist on the cloud, or is not yet ready to be accessed') file_size = int( file_details['size'] ) params['file_name'] = name # chunk_size if chunk_size: if chunk_size==0: raise Exception('the chunk_size should be a non zero integer value') if not isinstance(chunk_size, (int, long)): raise Exception('the chunk_size should be a non zero integer value') params['chunk_size'] = chunk_size # mapper _validate_arguments(mapper, 'mapper') # record_reader if not record_reader: record_reader = default_record_reader('\n') else: if isinstance(record_reader, basestring): record_reader = default_record_reader(record_reader) else: _validate_rr_arguments(record_reader, 'record_reader') # combiner if not combiner: def combiner(it): for x in it: yield x else: _validate_arguments(combiner, 'combiner') func_to_be_sent = _mapper_combiner_wrapper(mapper, name, file_size, record_reader, combiner) sfunc, sarg, logprefix, logcnt = cloud_obj.adapter.cloud_serialize( func_to_be_sent, params['fast_serialization'], [], logprefix='mapreduce.' ) data = Packer() data.add(sfunc) params['data'] = data.finish() # validate reducer & serialize reducer if reducer: _validate_arguments(reducer, 'reducer') reducer = _reducer_wrapper(reducer) s_reducer, red_sarg, red_logprefix, red_logcnt = cloud_obj.adapter.cloud_serialize( reducer, params['fast_serialization'], [], logprefix='mapreduce.reducer.' ) data_red = Packer() data_red.add(s_reducer) params['data_red'] = data_red.finish() conn = _getcloudnetconnection() conn._update_params(params) cloud_obj.adapter.dep_snapshot() resp = conn.send_request(_filemap_job_query, params) return resp['jids']
def setup_machine(email=None, password=None, api_key=None): """Prompts user for login information and then sets up api key on the local machine If api_key is a False value, interpretation is: None: create api key False: Prompt for key selection """ # Disable simulator -- we need to initiate net connections cloud.config.use_simulator = False cloud.config.commit() auth_token = None # authentication key derived from webserver injection # connect to picloud cloud._getcloud().open() interactive_mode = not (email and password) if system_browser_may_be_graphical() and interactive_mode: print 'To authenticate your computer, a web browser will be launched. Please login if prompted and follow instructions on screen.\n' raw_input('Press ENTER to continue') auth_info = web_acquire_token() if not auth_info: print 'Reverting to email/password authentication.\n' else: email, password = auth_info print '\n' if interactive_mode: print 'Please enter your PiCloud account login information.\nIf you do not have an account, please create one at http://www.picloud.com\n' + \ 'Note that a password is required. If you have not set one, set one at https://www.picloud.com/accounts/settings/\n' try: if email: print 'Setup will proceed using this E-mail: %s' % email else: email = raw_input('E-mail: ') if not password: password = getpass.getpass('Password: '******'\nYour API Key(s)' for key in keys: print key api_key = raw_input('\nPlease select an API Key or just press enter to create a new one automatically: ') if api_key: key = cloud.account.get_key(email, password, api_key) else: key = cloud.account.create_key(email, password) print 'API Key: %s' % key['api_key'] else: api_key = cloud.config.api_key if api_key and api_key != 'None': print 'Using existing API Key: %s' % api_key key = {'api_key' : api_key} else: key = cloud.account.create_key(email, password) print 'API Key: %s' % key['api_key'] else: key = cloud.account.get_key(email, password, api_key) print 'API Key: %s' % key['api_key'] # save all key credentials if 'api_secretkey' in key: credentials.save_keydef(key) # set config and write it to file cloud.config.api_key = key['api_key'] cloud.config.commit() cloud.cloudconfig.flush_config() # if user is running "picloud setup" with sudo, we need to chown # the config file so that it's owned by user and not root. fix_sudo_path(os.path.join(cloud.cloudconfig.fullconfigpath,cloud.cloudconfig.configname)) fix_sudo_path(cloud.cloudconfig.fullconfigpath) try: import platform conn = cloud._getcloudnetconnection() conn.send_request('report/install/', {'hostname': platform.node(), 'language_version': platform.python_version(), 'language_implementation': platform.python_implementation(), 'platform': platform.platform(), 'architecture': platform.machine(), 'processor': platform.processor(), 'pyexe_build' : platform.architecture()[0] }) except: pass except EOFError: sys.stderr.write('Got EOF. Please run "picloud setup" to complete installation.\n') sys.exit(1) except KeyboardInterrupt: sys.stderr.write('Got Keyboard Interrupt. Please run "picloud setup" to complete installation.\n') sys.exit(1) except cloud.CloudException, e: sys.stderr.write(str(e)+'\n') sys.exit(3)