Beispiel #1
0
def map(name, mapper, chunk_size=None, record_reader=None, combiner=None, reducer=None, **kwargs):
    """
    With map, you can process a file stored in cloud.files in parallel. The
    parallelism is achieved by dividing the file specified by *name* into
    chunks of size *chunk_size* (bytes). Each chunk is assigned a sub job. The
    sub job in turn processes just that chunk, allowing for the entire file to
    be processed by as many cores in parallel as there are chunks. We will call
    this type of sub job a "mapper sub job".
    
    If chunk_size is None, it will be automatically set to 1/10th of the size
    of the file.
    
    Map will return a single job IDentifier (jid). The sub jobs that comprise it
    do not have identifiers and, therefore, cannot be accessed directly.
    cloud.info(jid), however, will show you information for relevant sub jobs.
    
    By default, each chunk is split into records (of 0 or more characters) using
    newlines as delimiters. If *record_reader* is specified as a string, each
    chunk is split into records using that as the delimiter.
    
    In the event a record spans across two chunks, it is guaranteed that a mapper
    will only be called once on the full record. In other words, we've made sure
    it works correctly.
    
    *mapper* is a function that takes a single argument, a record, and should
    return an iterable of values (a generator). In the simplest case, it can
    return a generator that yields only one value.
    
    Example::
    
        def mapper(record):
            yield record
    
    When no *combiner* or *reducer* is specified, the return value of the
    cloud.files.map job will be roughly equivalent to::
            
            map(mapper, record_reader(file_contents))
    
    A *reducer* is a function that takes in an iterable of values and returns an 
    iterable of values.  The iterable parameter iterates through all the values 
    returned by all the mapper(record) calls. When the reducer is specified,
    *reducer* will result in the creation of one additional sub job. The reducer
    sub job grabs the results of each mapper sub job (iterators), combines them
    into a single iterator, and then passes that iterator into your *reducer*
    function. The return value of the cloud.files.map job will be the iterator
    returned by the *reducer*.
    
    A *combiner*, like a *reducer*, takes in an iterable of values and returns an
    iterable of values. The difference is that the *combiner* is run in each
    mapper sub job, and each one only takes in values that were produced from the
    associated chunk. If a *reducer* is also specified, then the reducer sub job
    grabs the results of each *combiner* run in each mapper sub job.
    
    Example for counting the number of words in a document::
    
        def wordcount_mapper(record):
            yield len(record.split(' '))
            
        def wordcount_reducer(wordcounts):
            yield sum(wordcounts)
            
        jid = cloud.files.map('example_document', wordcount_mapper, reducer=wordcount_reducer)
        
    Result::
        cloud.result(jid)
            >> [# of words]
    
    For advanced users, *record_reader* can also be specified as a function that
    takes in a file-like object (has methods read(), tell(), and seek()), and
    the end_byte for the current chunk. The *record_reader* should return an
    iterable of records.  See default_record_reader for an example.
    
    Additional information exists on our blog and online documentation.
        
        Reserved special *kwargs* (see docs for details):
        
        * _cores:
            Set number of cores your job will utilize. See http://blog.picloud.com/2012/08/31/introducing-multicore-support/
            In addition to having access to more cores, you will have _cores*RAM[_type] where _type is the _type you select
            Possible values depend on what _type you choose:
            
            'c1': 1
            'c2': 1, 2, 4, 8
            'f2': 1, 2, 4, 8, 16
            'm1': 1, 2
            's1': 1        
        * _depends_on:
            An iterable of jids that represents all jobs that must complete successfully 
            before any jobs created by this map function may be run.
        * _depends_on_errors:
            A string specifying how an error with a jid listed in _depends_on should be handled.
            'abort': Set this job to 'stalled'  (Default)
            'ignore': Treat an error as satisfying the dependency            
        * _env:
            A string specifying a custom environment you wish to run your jobs within.
            See environments overview at 
            http://blog.picloud.com/2011/09/26/introducing-environments-run-anything-on-picloud/
        * _fast_serialization:
            This keyword can be used to speed up serialization, at the cost of some functionality.
            This affects the serialization of both the map arguments and return values
            The map function will always be serialized by the enhanced serializer, with debugging features.
            Possible values keyword are:
                        
            0. default -- use cloud module's enhanced serialization and debugging info            
            1. no debug -- Disable all debugging features for arguments            
            2. use cPickle -- Use python's fast serializer, possibly causing PicklingErrors                
        * _kill_process:
                Terminate the Python interpreter *func* runs in after *func* completes, preventing
                the interpreter from being used by subsequent jobs.  See Technical Overview for more info.                            
        * _label: 
            A user-defined string label that is attached to the created jobs. 
            Labels can be used to filter when viewing jobs interactively (i.e.
            on the PiCloud website).        
        * _max_runtime:
            Specify the maximum amount of time (in integer minutes) a job can run. If job runs beyond 
            this time, it will be killed.                     
        * _priority: 
                A positive integer denoting the job's priority. PiCloud tries to run jobs 
                with lower priority numbers before jobs with higher priority numbers.            
        * _profile:
                Set this to True to enable profiling of your code. Profiling information is 
                valuable for debugging, but may slow down your job.
        * _restartable:
                In the very rare event of hardware failure, this flag indicates that the job
                can be restarted if the failure happened in the middle of the job.
                By default, this is true. This should be unset if the job has external state
                (e.g. it modifies a database entry)
        * _type:
                Select the type of compute resources to use.  PiCloud supports four types,
                specified as strings:
                
                'c1'
                    1 compute unit, 300 MB ram, low I/O (default)                    
                'c2'
                    2.5 compute units, 800 MB ram, medium I/O                    
                'm1'                    
                    3.25 compute units, 8 GB ram, high I/O
                's1'
                    variable compute units (2 cu max), 300 MB ram, low I/O, 1 IP per core                    
                                    
                See http://www.picloud.com/pricing/ for pricing information
    """
    
    cloud_obj = _getcloud()
    params = cloud_obj._getJobParameters(mapper, kwargs)    # takes care of kwargs
    
    file_details = _file_info(name)
    if not file_details['exists']:
        raise ValueError('file does not exist on the cloud, or is not yet ready to be accessed')
    file_size = int( file_details['size'] )
    params['file_name'] = name
    
    
    # chunk_size
    if chunk_size:
        if chunk_size==0:
            raise Exception('the chunk_size should be a non zero integer value')
        if not isinstance(chunk_size, (int, long)):
            raise Exception('the chunk_size should be a non zero integer value')        
        params['chunk_size'] = chunk_size
            
    
    # mapper
    _validate_arguments(mapper, 'mapper')
    
    # record_reader
    if not record_reader:
        record_reader = default_record_reader('\n')
    else:
        if isinstance(record_reader, basestring):
            record_reader = default_record_reader(record_reader)
        else:
            _validate_rr_arguments(record_reader, 'record_reader')
    
    # combiner
    if not combiner:
        def combiner(it):
            for x in it:
                yield x
    else:
        _validate_arguments(combiner, 'combiner')


    func_to_be_sent = _mapper_combiner_wrapper(mapper, name, file_size, record_reader, combiner)
    
    sfunc, sarg, logprefix, logcnt = cloud_obj.adapter.cloud_serialize( func_to_be_sent, 
                                                                    params['fast_serialization'], 
                                                                    [], 
                                                                    logprefix='mapreduce.' )
    
    data = Packer()
    data.add(sfunc)
    params['data'] = data.finish()
    
    # validate reducer & serialize reducer
    if reducer:
        _validate_arguments(reducer, 'reducer')
        reducer = _reducer_wrapper(reducer)
        s_reducer, red_sarg, red_logprefix, red_logcnt = cloud_obj.adapter.cloud_serialize( reducer, params['fast_serialization'], [], logprefix='mapreduce.reducer.' )
        data_red = Packer()
        data_red.add(s_reducer)
        params['data_red'] = data_red.finish()
        
    conn = _getcloudnetconnection()
    conn._update_params(params)
    cloud_obj.adapter.dep_snapshot()
    
    resp = conn.send_request(_filemap_job_query, params)
    
    return resp['jids']
def setup_machine(email=None, password=None, api_key=None):
    """Prompts user for login information and then sets up api key on the
    local machine
    
    If api_key is a False value, interpretation is:
        None: create api key
        False: Prompt for key selection
    
    """
    
    # Disable simulator -- we need to initiate net connections
    cloud.config.use_simulator = False
    cloud.config.commit()
    
    auth_token = None # authentication key derived from webserver injection
    
    # connect to picloud
    cloud._getcloud().open()
    
    interactive_mode = not (email and password)
    
    if system_browser_may_be_graphical() and interactive_mode:
        print 'To authenticate your computer, a web browser will be launched.  Please login if prompted and follow instructions on screen.\n'
        raw_input('Press ENTER to continue')
        
        
        auth_info = web_acquire_token()
        if not auth_info:
            print 'Reverting to email/password authentication.\n'
        else:
            email, password = auth_info
            print '\n'
    
    if interactive_mode:
        print 'Please enter your PiCloud account login information.\nIf you do not have an account, please create one at http://www.picloud.com\n' + \
        'Note that a password is required. If you have not set one, set one at https://www.picloud.com/accounts/settings/\n'
    
    try:
        if email:            
            print 'Setup will proceed using this E-mail: %s' % email
        else:
            email = raw_input('E-mail: ')
            
        if not password:
            password = getpass.getpass('Password: '******'\nYour API Key(s)'
                for key in keys:
                    print key
                
                api_key = raw_input('\nPlease select an API Key or just press enter to create a new one automatically: ')
                if api_key:
                    key = cloud.account.get_key(email, password, api_key)
                else:
                    key = cloud.account.create_key(email, password)
                    print 'API Key: %s' % key['api_key']
            else:
                api_key = cloud.config.api_key
                if api_key and api_key != 'None':
                    print 'Using existing API Key: %s' % api_key
                    key = {'api_key' : api_key}
                else:
                    key = cloud.account.create_key(email, password)
                    print 'API Key: %s' % key['api_key']
            
        else:
            key = cloud.account.get_key(email, password, api_key)
            print 'API Key: %s' % key['api_key']
        
        # save all key credentials
        if 'api_secretkey' in key:
            credentials.save_keydef(key)
        
        # set config and write it to file
        cloud.config.api_key = key['api_key']
        cloud.config.commit()
        cloud.cloudconfig.flush_config()        
                
        
        # if user is running "picloud setup" with sudo, we need to chown
        # the config file so that it's owned by user and not root.
        fix_sudo_path(os.path.join(cloud.cloudconfig.fullconfigpath,cloud.cloudconfig.configname))
        fix_sudo_path(cloud.cloudconfig.fullconfigpath)
        
        try:
            import platform
            conn = cloud._getcloudnetconnection()
            conn.send_request('report/install/', {'hostname': platform.node(),
                                                  'language_version': platform.python_version(),
                                                  'language_implementation': platform.python_implementation(),
                                                  'platform': platform.platform(),
                                                  'architecture': platform.machine(),
                                                  'processor': platform.processor(),
                                                  'pyexe_build' : platform.architecture()[0]
                                                  })
        except:
            pass
        
    except EOFError:
        sys.stderr.write('Got EOF. Please run "picloud setup" to complete installation.\n')
        sys.exit(1)
    except KeyboardInterrupt:
        sys.stderr.write('Got Keyboard Interrupt. Please run "picloud setup" to complete installation.\n')
        sys.exit(1)
    except cloud.CloudException, e:
        sys.stderr.write(str(e)+'\n')
        sys.exit(3)