def run(self): """ """ token = 'bRIJb9jp5igAAAAAAAAACc5QzQ619Vp0pYa2PdIrt0q2y0qFyJgwrKvtzuTp3Sz_' client = dropbox.client.DropboxClient(token) parameters = {'size': igeom(128, 2048, 5), 'db_if': ['rest', 'sdk']} combs = sweep(parameters) sweeper = ParamSweeper(self.result_dir + "/sweeps", combs) f = open(self.result_dir + '/results.txt', 'w') while len(sweeper.get_remaining()) > 0: comb = sweeper.get_next() logger.info('Treating combination %s', pformat(comb)) comb_dir = self.result_dir + '/' + slugify(comb) try: os.mkdir(comb_dir) except: pass fname = self.create_file(comb['size']) timer = Timer() if comb['db_if'] == 'sdk': self.upload_file_sdk(client, fname, fname.split('/')[-1]) up_time = timer.elapsed() self.download_file_sdk(client, fname.split('/')[-1], comb_dir + fname.split('/')[-1]) dl_time = timer.elapsed() - up_time sweeper.done(comb) elif comb['db_if'] == 'rest': logger.warning('REST interface not implemented') sweeper.skip(comb) continue os.remove(fname) f.write("%f %i %f %f \n" % (timer.start_date(), comb['size'], up_time, dl_time)) f.close()
class oar_replay_workload(Engine): def init(self): parser = self.options_parser parser.add_option('--is_a_test', dest='is_a_test', action='store_true', default=False, help='prefix the result folder with "test", enter ' 'a debug mode if fails and remove the job ' 'afterward, unless it is a reservation') parser.add_option('--already_configured', dest='already_configured', action='store_true', default=False, help='if set, the OAR cluster is not re-configured') parser.add_option('--reservation_id', help="Grid'5000 reservation job ID") parser.add_argument('experiment_config', 'The config JSON experiment description file') def setup_result_dir(self): is_a_test = self.options.is_a_test run_type = "" if is_a_test: run_type = "test_" self.result_dir = script_path + '/' + run_type + 'results_' + \ time.strftime("%Y-%m-%d--%H-%M-%S") logger.info('resutlt directory: {}'.format(self.result_dir)) def run(self): """Run the experiment""" already_configured = self.options.already_configured reservation_job_id = int(self.options.reservation_id) \ if self.options.reservation_id is not None else None is_a_test = self.options.is_a_test if is_a_test: logger.warn('THIS IS A TEST! This run will use only a few ' 'resources') # make the result folder writable for all os.chmod(self.result_dir, 0o777) # Import configuration with open(self.args[0]) as config_file: config = json.load(config_file) # backup configuration copy(self.args[0], self.result_dir) site = config["grid5000_site"] resources = config["resources"] nb_experiment_nodes = config["nb_experiment_nodes"] walltime = str(config["walltime"]) env_name = config["kadeploy_env_name"] workloads = config["workloads"] # check if workloads exists (Suppose that the same NFS mount point # is present on the remote and the local environment for workload_file in workloads: with open(workload_file): pass # copy the workloads files to the results dir copy(workload_file, self.result_dir) # define the workloads parameters self.parameters = { 'workload_filename': workloads } logger.info('Workloads: {}'.format(workloads)) # define the iterator over the parameters combinations self.sweeper = ParamSweeper(os.path.join(self.result_dir, "sweeps"), sweep(self.parameters)) # Due to previous (using -c result_dir) run skip some combination logger.info('Skipped parameters:' + '{}'.format(str(self.sweeper.get_skipped()))) logger.info('Number of parameters combinations {}'.format( str(len(self.sweeper.get_remaining())))) logger.info('combinations {}'.format( str(self.sweeper.get_remaining()))) if reservation_job_id is not None: jobs = [(reservation_job_id, site)] else: jobs = oarsub([(OarSubmission(resources=resources, job_type='deploy', walltime=walltime), site)]) job_id, site = jobs[0] if job_id: try: logger.info("waiting job start %s on %s" % (job_id, site)) wait_oar_job_start( job_id, site, prediction_callback=prediction_callback) logger.info("getting nodes of %s on %s" % (job_id, site)) nodes = get_oar_job_nodes(job_id, site) # sort the nodes nodes = sorted(nodes, key=lambda node: node.address) # get only the necessary nodes under the switch if nb_experiment_nodes > len(nodes): raise RuntimeError('The number of given node in the ' 'reservation ({}) do not match the ' 'requested resources ' '({})'.format(len(nodes), nb_experiment_nodes)) nodes = nodes[:nb_experiment_nodes] logger.info("deploying nodes: {}".format(str(nodes))) deployed, undeployed = deploy( Deployment(nodes, env_name=env_name), check_deployed_command=already_configured) if undeployed: logger.warn( "NOT deployed nodes: {}".format(str(undeployed))) raise RuntimeError('Deployement failed') if not already_configured: # install OAR install_cmd = "apt-get update; apt-get install -y " node_packages = "oar-node" logger.info( "installing OAR nodes: {}".format(str(nodes[1:]))) install_oar_nodes = Remote( install_cmd + node_packages, nodes[1:], connection_params={'user': '******'}) install_oar_nodes.start() server_packages = ("oar-server oar-server-pgsql oar-user " "oar-user-pgsql postgresql python3-pip " "libjson-perl postgresql-server-dev-all") install_oar_sched_cmd = """ mkdir -p /opt/oar_sched; \ cd /opt/oar_sched; \ git clone https://github.com/oar-team/oar3.git; \ cd oar3; \ git checkout dce942bebc2; \ pip3 install -e .; \ cd /usr/lib/oar/schedulers; \ ln -s /usr/local/bin/kamelot; \ pip3 install psycopg2 """ logger.info("installing OAR server node: {}".format(str(nodes[0]))) install_master = SshProcess(install_cmd + server_packages + ";" + install_oar_sched_cmd, nodes[0], connection_params={'user': '******'}) install_master.run() install_oar_nodes.wait() if not install_master.ok: Report(install_master) configure_oar_cmd = """ sed -i \ -e 's/^\(DB_TYPE\)=.*/\\1="Pg"/' \ -e 's/^\(DB_HOSTNAME\)=.*/\\1="localhost"/' \ -e 's/^\(DB_PORT\)=.*/\\1="5432"/' \ -e 's/^\(DB_BASE_PASSWD\)=.*/\\1="oar"/' \ -e 's/^\(DB_BASE_LOGIN\)=.*/\\1="oar"/' \ -e 's/^\(DB_BASE_PASSWD_RO\)=.*/\\1="oar_ro"/' \ -e 's/^\(DB_BASE_LOGIN_RO\)=.*/\\1="oar_ro"/' \ -e 's/^\(SERVER_HOSTNAME\)=.*/\\1="localhost"/' \ -e 's/^\(SERVER_PORT\)=.*/\\1="16666"/' \ -e 's/^\(LOG_LEVEL\)\=\"2\"/\\1\=\"3\"/' \ -e 's#^\(LOG_FILE\)\=.*#\\1="{result_dir}/oar.log"#' \ -e 's/^\(JOB_RESOURCE_MANAGER_PROPERTY_DB_FIELD\=\"cpuset\".*\)/#\\1/' \ -e 's/^#\(CPUSET_PATH\=\"\/oar\".*\)/\\1/' \ -e 's/^\(FINAUD_FREQUENCY\)\=.*/\\1="0"/' \ /etc/oar/oar.conf """.format(result_dir=self.result_dir) configure_oar = Remote(configure_oar_cmd, nodes, connection_params={'user': '******'}) configure_oar.run() logger.info("OAR is configured on all nodes") # Configure server create_db = "oar-database --create --db-is-local" config_oar_sched = ("oarnotify --remove-queue default;" "oarnotify --add-queue default,1,kamelot") start_oar = "systemctl start oar-server.service" logger.info( "configuring OAR database: {}".format(str(nodes[0]))) config_master = SshProcess(create_db + ";" + config_oar_sched + ";" + start_oar, nodes[0], connection_params={'user': '******'}) config_master.run() # propagate SSH keys logger.info("configuring OAR SSH") oar_key = "/tmp/.ssh" Process('rm -rf ' + oar_key).run() Process('scp -o BatchMode=yes -o PasswordAuthentication=no ' '-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null ' '-o ConnectTimeout=20 -rp -o User=root ' + nodes[0].address + ":/var/lib/oar/.ssh" ' ' + oar_key).run() # Get(nodes[0], "/var/lib/oar/.ssh", [oar_key], connection_params={'user': '******'}).run() Put(nodes[1:], [oar_key], "/var/lib/oar/", connection_params={'user': '******'}).run() add_resources_cmd = """ oarproperty -a cpu || true; \ oarproperty -a core || true; \ oarproperty -c -a host || true; \ oarproperty -a mem || true; \ """ for node in nodes[1:]: add_resources_cmd = add_resources_cmd + "oarnodesetting -a -h {node} -p host={node} -p cpu=1 -p core=4 -p cpuset=0 -p mem=16; \\\n".format(node=node.address) add_resources = SshProcess(add_resources_cmd, nodes[0], connection_params={'user': '******'}) add_resources.run() if add_resources.ok: logger.info("oar is now configured!") else: raise RuntimeError("error in the OAR configuration: Abort!") # TODO backup de la config de OAR # Do the replay logger.info('begining the replay') while len(self.sweeper.get_remaining()) > 0: combi = self.sweeper.get_next() workload_file = os.path.basename(combi['workload_filename']) oar_replay = SshProcess(script_path + "/oar_replay.py " + combi['workload_filename'] + " " + self.result_dir + " oar_gant_" + workload_file, nodes[0]) oar_replay.stdout_handlers.append(self.result_dir + '/' + workload_file + '.out') logger.info("replaying workload: {}".format(combi)) oar_replay.run() if oar_replay.ok: logger.info("Replay workload OK: {}".format(combi)) self.sweeper.done(combi) else: logger.info("Replay workload NOT OK: {}".format(combi)) self.sweeper.cancel(combi) raise RuntimeError("error in the OAR replay: Abort!") except: traceback.print_exc() ipdb.set_trace() finally: if is_a_test: ipdb.set_trace() if reservation_job_id is None: logger.info("delete job: {}".format(jobs)) oardel(jobs)
if __name__ == "__main__": logging.basicConfig(level=logging.DEBUG) sweeps = sweep(PARAMETERS) sweeper = ParamSweeper( # Maybe puts the sweeper under the experimentation directory # This should be current/sweeps persistence_dir=os.path.join("%s/sweeps" % TEST_DIR), sweeps=sweeps, save_sweeps=True, name="test_case_1") #Get the next parameter in the set of all remaining params #This set is temporary viewed as sorted List with this filter function. params = sweeper.get_next(sort_params_by_nbr_clients) while params: if not accept(params): # skipping element # Note that the semantic of sweeper.skip is different sweeper.done(params) params = sweeper.get_next(sort_params_by_nbr_clients) continue # cleaning old backup_dir params.pop("backup_dir", None) params.update({"backup_dir": generate_id(params)}) t.g5k(broker=BROKER, env=TEST_DIR) t.inventory() t.prepare(broker=BROKER) print(params) t.test_case_1(**params)
class overturn(Engine): def create_sweeper(self): """Define the parameter space and return a sweeper.""" parameters = { 'RA': ['1.e5', '1.e6', '1.e7'], 'RCMB' : [2.], 'KFe' : [0.85, 0.9, 0.95, 0.99] } sweeps = sweep(parameters) self.sweeper = ParamSweeper(os.path.join(self.result_dir, "sweeps"), sweeps) def create_par_file(self, comb): """Create Run directory on remote server and upload par file""" logger.info('Creating and uploading par file') comb_dir = parent_dir + slugify(comb) + '/' logger.info('comb_dir = ' + comb_dir) # Create remote directories make_dirs = SshProcess('mkdir -p ' + comb_dir + 'Img ; mkdir -p ' + comb_dir + 'Op ; ', jobserver).run() # Generate par file par_file = 'par_' + slugify(comb) logger.info('par_file = %s', style.emph(par_file)) nml = f90nml.read('template.nml') nml['refstate']['ra0'] = float(comb['RA']) nml['tracersin']['K_Fe'] = comb['KFe'] nml['geometry']['r_cmb'] = comb['RCMB'] nztot = min(int(2**(math.log10(float(comb['RA']))+1)), 128) nml['geometry']['nztot'] = nztot nml['geometry']['nytot'] = int(math.pi*(comb['RCMB']+0.5)*nztot) nml.write(par_file, force=True) logger.info('Created par file ' + par_file) # Upload par file to remote directory Put([jobserver], [par_file], remote_location=comb_dir).run() SshProcess('cd ' + comb_dir + ' ; mv ' + par_file+ ' par', jobserver).run() logger.info('Done') def submit_job(self, comb): """Use the batch script on psmn""" logger.info('Submit job on '+ jobserver) comb_dir = parent_dir + slugify(comb) + '/' job_sub = SshProcess('cd ' + comb_dir + ' ; /usr/local/bin/qsub /home/stephane/ExamplePBS/batch_single', jobserver).run() return job_sub.stdout.splitlines()[-1].split('.')[0] def is_job_running(self, job_id=None): """ """ get_state = SshProcess('qstat -f ' + str(job_id), jobserver) get_state.ignore_exit_code = True get_state.run() return get_state.ok def retrieve(self): """ """ SshProcess('') def workflow(self, comb): self.create_par_file(comb) job_id = self.submit_job(comb) logger.info('Combination %s will be treated by job %s', slugify(comb), str(job_id)) while self.is_job_running(job_id): sleep(10) self.sweeper.done(comb) def run(self): self.create_sweeper() logger.info('%s parameters combinations to be treated', len(self.sweeper.get_sweeps())) threads = [] while len(self.sweeper.get_remaining()) > 0: comb = self.sweeper.get_next() logger.info('comb = %s', comb) t = Thread(target=self.workflow, args=(comb,)) t.daemon = True threads.append(t) t.start() for t in threads: t.join()
def run(self): """ run method from engine in order to do our workflow """ mongo = ClientMongo() size = dict if not self.options.file: if not self.options.only: size = { 1, long(self.options.size * 0.25), long(self.options.size * 0.5), long(self.options.size * 0.75), long(self.options.size) } else: size = {long(self.options.size)} else: if self.OnlyDownload: size = getFilSize(self.options.file) else: size = {0} drive = None if self.options.drive: drive = self.options.drive else: drive = self.drive interface = ['rest', 'sdk'] parameters = { 'size': size, 'if': interface, 'drive': drive, 'transfert': self.transfert } p = None for n in range(0, int(self.options.ntest), 1): logger.info('---------------------') logger.info('Round %i', n + 1) combs = sweep(parameters) date = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S") pathResults = os.getcwd() + '/Results/Bench' + date sweeper = ParamSweeper(pathResults + "/sweeps", combs) f = open(pathResults + '/results.txt', 'w') while len(sweeper.get_remaining()) > 0: # sort the parameters for i in interface: for dr in drive: for s in size: comb = sweeper.get_next(filtr=lambda r: filter( lambda x: x['drive'] == dr and x['size'] == s and x['if'] == i, r)) if not comb: continue # start of the workflow if comb['drive'] == 'amazon': p = providerS3.ProviderS3() elif comb['drive'] == 'dropbox': p = providerDB.ProviderDB() else: p = providerGD.ProviderGD() logger.info('Treating combination %s', pformat(comb)) comb_dir = pathResults + '/' + slugify(comb) if not os.path.isdir(comb_dir): os.mkdir(comb_dir) if not self.options.file: fname = self.create_file(comb['size']) else: fname = self.options.file timer = Timer() up_time = 0 dl_time = 0 start_date = datetime.datetime.now() if comb['if'] == 'sdk': if p.provider_name == "amazon": # AMAZON clientAmz = p.getConnexion() if self.OnlyDownload: p.bucketKey += fname else: p.bucketKey += '/' + fname if comb['transfert'] == "upload" or comb[ 'transfert'] == 'upDown': p.upload_file_sdk( clientAmz.get_bucket(p.bucketName), p.bucketKey, fname) up_time = timer.elapsed() if comb['transfert'] == "download" or comb[ 'transfert'] == 'upDown': p.download_file_sdk( clientAmz.get_bucket(p.bucketName), p.bucketKey, comb_dir + '/' + fname.split('/')[-1]) dl_time = timer.elapsed() - up_time if not self.OnlyDownload: p.delete_file_sdk( clientAmz.get_bucket(p.bucketName), p.bucketKey) elif p.provider_name == "dropbox": # DROPBOX client = p.getToken() if comb['transfert'] == "upload" or comb[ 'transfert'] == 'upDown': p.upload_file_sdk( client, fname, fname.split('/')[-1]) up_time = timer.elapsed() if comb['transfert'] == "download" or comb[ 'transfert'] == 'upDown': p.download_file_sdk( client, fname.split('/')[-1], comb_dir + '/' + fname.split('/')[-1]) dl_time = timer.elapsed() - up_time if not self.OnlyDownload: p.delete_file(client, fname.split('/')[-1]) elif p.provider_name == "googledrive": # GOOGLEDRIVE drive_service = p.getConnexion() new_file = None if comb['transfert'] == 'upload' or comb[ 'transfert'] == 'upDown': new_file = p.upload_file_sdk( drive_service, fname, fname.split('/')[-1], 'text/plain') up_time = timer.elapsed() if comb['transfert'] == 'download' or comb[ 'transfert'] == 'upDown': p.download_file_sdk( drive_service, new_file, comb_dir + '/' + fname.split('/')[-1]) dl_time = timer.elapsed() - up_time if not self.OnlyDownload: p.delete_file_sdk( drive_service, new_file['id']) sweeper.done(comb) elif comb['if'] == 'rest': logger.warning( 'REST interface not implemented') sweeper.skip(comb) if not self.OnlyDownload: # logger.info('delete de '+fname) if os.path.isfile(fname): os.remove(fname) # delete only if rest is implmented # os.remove(comb_dir + '/' + fname.split('/')[-1]) continue if comb['transfert'] == "upload" or comb[ 'transfert'] == "upDown": f.write("%s %s %s %s %s %s %s %f %i %s %f\n" % (self.localisation['ip'], self.localisation['lat'], self.localisation['lon'], self.localisation['city'], self.localisation['country'], comb['drive'], comb['if'], timer.start_date(), comb['size'], "upload", up_time)) mongo.collection.insert({ 'ip': self.localisation['ip'], 'latitude': self.localisation['lat'], 'longitude': self.localisation['lon'], 'city': self.localisation['city'], 'country': self.localisation['country'], 'drive': comb['drive'], 'interface': comb['if'], 'start_date': start_date, 'size': comb['size'], 'transfert': 'upload', 'time': up_time }) if comb['transfert'] == "download" or comb[ 'transfert'] == "upDown": f.write("%s %s %s %s %s %s %s %f %i %s %f\n" % (self.localisation['ip'], self.localisation['lat'], self.localisation['lon'], self.localisation['city'], self.localisation['country'], comb['drive'], comb['if'], timer.start_date(), comb['size'], "download", dl_time)) mongo.collection.insert({ 'ip': self.localisation['ip'], 'latitude': self.localisation['lat'], 'longitude': self.localisation['lon'], 'city': self.localisation['city'], 'country': self.localisation['country'], 'drive': comb['drive'], 'interface': comb['if'], 'start_date': start_date, 'size': comb['size'], 'transfert': 'download', 'time': dl_time }) if not self.OnlyDownload: # logger.info('delete de '+fname) if os.path.isfile(fname): os.remove(fname) if os.path.isfile(comb_dir + '/' + fname): os.remove(comb_dir + '/' + fname.split('/')[-1]) f.close() # delete the Bench Folder os.rmdir(self.result_dir) logger.info("---------------------------------------") for t in check_Exp_database(self.options, self.localisation)['result']: logger.info(t)
def run(self): # Defining experiment parameters self.parameters = { 'n_clients': [400, 450, 500, 550, 600], 'n_transitions': [10000] } cluster = 'griffon' sweeps = sweep(self.parameters) sweeper = ParamSweeper(os.path.join(self.result_dir, "sweeps"), sweeps) server_out_path = os.path.join(self.result_dir, "server.out") self._updateStat(sweeper.stats()) # Loop on the number of nodes while True: # Taking the next parameter combinations comb = sweeper.get_next() if not comb: break # Performing the submission on G5K site = get_cluster_site(cluster) self._log("Output will go to " + self.result_dir) n_nodes = int(math.ceil(float(comb['n_clients']) / EX5.get_host_attributes(cluster + '-1')['architecture']['smt_size'])) + 1 self._log("Reserving {0} nodes on {1}".format(n_nodes, site)) resources = "{cluster=\\'" + cluster + "\\'}/nodes=" + str(n_nodes) submission = EX5.OarSubmission(resources = resources, job_type = 'allow_classic_ssh', walltime ='00:10:00') job = EX5.oarsub([(submission, site)]) self.__class__._job = job # Sometimes oarsub fails silently if job[0][0] is None: print("\nError: no job was created") sys.exit(1) # Wait for the job to start self._log("Waiting for job {0} to start...\n".format(BOLD_MAGENTA + str(job[0][0]) + NORMAL)) EX5.wait_oar_job_start(job[0][0], job[0][1], prediction_callback = prediction) nodes = EX5.get_oar_job_nodes(job[0][0], job[0][1]) # Deploying nodes #deployment = EX5.Deployment(hosts = nodes, env_file='path_to_env_file') #run_deploy = EX5.deploy(deployment) #nodes_deployed = run_deploy.hosts[0] # Copying active_data program on all deployed hosts EX.Put([nodes[0]], '../dist/active-data-lib-0.1.2.jar', connexion_params = {'user': '******'}).run() EX.Put([nodes[0]], '../server.policy', connexion_params = {'user': '******'}).run() # Loop on the number of requests per client process while True: # Split the nodes clients = nodes[1:] server = nodes[0] self._log("Running experiment with {0} nodes and {1} transitions per client".format(len(clients), comb['n_transitions'])) # Launching Server on one node out_handler = FileOutputHandler(server_out_path) launch_server = EX.Remote('java -jar active-data-lib-0.1.2.jar', [server], stdout_handler = out_handler, stderr_handler = out_handler).start() self._log("Server started on " + server.address) time.sleep(2) # Launching clients rank=0 n_cores = EX5.get_host_attributes(clients[0])['architecture']['smt_size']; cores = nodes * n_cores cores = cores[0:comb['n_clients']] # Cut out the additional cores client_connection_params = { 'taktuk_gateway': 'lyon.grid5000.fr', 'host_rewrite_func': None } self._log("Launching {0} clients...".format(len(cores))) client_cmd = "/usr/bin/env java -cp /home/ansimonet/active-data-lib-0.1.2.jar org.inria.activedata.examples.perf.TransitionsPerSecond " + \ "{0} {1} {2} {3} {4}".format(server.address, 1200, "{{range(len(cores))}}", len(cores), comb['n_transitions']) client_out_handler = FileOutputHandler(os.path.join(self.result_dir, "clients.out")) client_request = EX.TaktukRemote(client_cmd, cores, connexion_params = client_connection_params, \ stdout_handler = client_out_handler, stderr_handler = client_out_handler) client_request.run() if not client_request.ok(): # Some client failed, please panic self._log("One or more client process failed. Enjoy reading their outputs.") self._log("OUTPUT STARTS -------------------------------------------------\n") for process in client_request.processes(): print("----- {0} returned {1}".format(process.host().address, process.exit_code())) if not process.stdout() == "": print(GREEN + process.stdout() + NORMAL) if not process.stderr() == "": print(RED + process.stderr() + NORMAL) print("") self._log("OUTPUT ENDS ---------------------------------------------------\n") sweeper.skip(comb) launch_server.kill() launch_server.wait() else: # Waiting for server to end launch_server.wait() # Getting log files distant_path = OUT_FILE_FORMAT.format(len(cores), comb['n_transitions']) local_path = distant_path EX.Get([server], distant_path).run() EX.Local('mv ' + distant_path + ' ' + os.path.join(self.result_dir, local_path)).run() EX.Get([server], 'client_*.out', local_location = self.result_dir) EX.Remote('rm -f client_*.out', [server]) self._log("Finishing experiment with {0} clients and {1} transitions per client".format(comb['n_clients'], comb['n_transitions'])) sweeper.done(comb) sub_comb = sweeper.get_next (filtr = lambda r: filter(lambda s: s["n_clients"] == comb['n_clients'], r)) self._updateStat(sweeper.stats()) if not sub_comb: # Killing job EX5.oar.oardel(job) self.__class__._job = None break else: comb = sub_comb print ""
class mpi_bench(Engine): def run(self): """Inherited method, put here the code for running the engine""" self.define_parameters() if self.prepare_bench(): logger.info('Bench prepared on all frontends') self.run_xp() def define_parameters(self): """Create the iterator on the parameters combinations to be explored""" # fixed number of nodes self.n_nodes = 4 # choose a list of clusters clusters = ['graphene', 'petitprince', 'edel', 'paradent', 'stremi'] #clusters = ['petitprince', 'paradent'] # compute the maximum number of cores among all clusters max_core = self.n_nodes * max([ get_host_attributes(cluster + '-1')['architecture']['smt_size'] for cluster in clusters]) # define the parameters self.parameters = { 'cluster' : clusters, 'n_core': filter(lambda i: i >= self.n_nodes, list(takewhile(lambda i: i<max_core, (2**i for i in count(0, 1))))), 'size' : ['A', 'B', 'C'] } logger.info(self.parameters) # define the iterator over the parameters combinations self.sweeper = ParamSweeper(os.path.join(self.result_dir, "sweeps"), sweep(self.parameters)) logger.info('Number of parameters combinations %s' % len(self.sweeper.get_remaining())) def prepare_bench(self): """bench configuration and compilation, copy binaries to frontends return True if preparation is ok """ logger.info("preparation: configure and compile benchmark") # the involved sites. We will do the compilation on the first of these. sites = list(set(map(get_cluster_site, self.parameters['cluster']))) # generate the bench compilation configuration bench_list = '\n'.join([ 'lu\t%s\t%s' % (size, n_core) for n_core in self.parameters['n_core'] for size in self.parameters['size'] ]) # Reserving a node because compiling on the frontend is forbidden # and because we need mpif77 jobs = oarsub([(OarSubmission(resources = "nodes=1", job_type = 'allow_classic_ssh', walltime ='0:10:00'), sites[0])]) if jobs[0][0]: try: logger.info("copying bench archive to %s" % (sites[0],)) copy_bench = Put([sites[0]], ['NPB3.3-MPI.tar.bz2']).run() logger.info("extracting bench archive on %s" % (sites[0],)) extract_bench = Remote('tar -xjf NPB3.3-MPI.tar.bz2', [sites[0]]).run() logger.info("waiting job start %s" % (jobs[0],)) wait_oar_job_start(*jobs[0], prediction_callback = pred_cb) logger.info("getting nodes of %s" % (jobs[0],)) nodes = get_oar_job_nodes(*jobs[0]) logger.info("configure bench compilation") conf_bench = Remote('echo "%s" > ~/NPB3.3-MPI/config/suite.def' % bench_list, nodes).run() logger.info("compil bench") compilation = Remote('cd NPB3.3-MPI && make clean && make suite', nodes).run() logger.info("compil finished") except: logger.error("unable to compile bench") return False finally: oardel(jobs) # Copying binaries to all other frontends frontends = sites[1:] rsync = Remote('rsync -avuP ~/NPB3.3-MPI/ {{frontends}}:NPB3.3-MPI', [get_host_site(nodes[0])] * len(frontends)) rsync.run() return compilation.ok and rsync.ok def run_xp(self): """Iterate over the parameters and execute the bench""" while len(self.sweeper.get_remaining()) > 0: comb = self.sweeper.get_next() if comb['n_core'] > get_host_attributes(comb['cluster']+'-1')['architecture']['smt_size'] * self.n_nodes: self.sweeper.skip(comb) continue logger.info('Processing new combination %s' % (comb,)) site = get_cluster_site(comb['cluster']) jobs = oarsub([(OarSubmission(resources = "{cluster='" + comb['cluster']+"'}/nodes=" + str(self.n_nodes), job_type = 'allow_classic_ssh', walltime ='0:10:00'), site)]) if jobs[0][0]: try: wait_oar_job_start(*jobs[0]) nodes = get_oar_job_nodes(*jobs[0]) bench_cmd = 'mpirun -H %s -n %i %s ~/NPB3.3-MPI/bin/lu.%s.%i' % ( ",".join([node.address for node in nodes]), comb['n_core'], get_mpi_opts(comb['cluster']), comb['size'], comb['n_core']) lu_bench = SshProcess(bench_cmd, nodes[0]) lu_bench.stdout_handlers.append(self.result_dir + '/' + slugify(comb) + '.out') lu_bench.run() if lu_bench.ok: logger.info("comb ok: %s" % (comb,)) self.sweeper.done(comb) continue finally: oardel(jobs) logger.info("comb NOT ok: %s" % (comb,)) self.sweeper.cancel(comb)
def run(self): # Defining experiment parameters self.parameters = { 'n_clients': [400, 450, 500, 550, 600], 'n_transitions': [10000] } cluster = 'griffon' sweeps = sweep(self.parameters) sweeper = ParamSweeper(os.path.join(self.result_dir, "sweeps"), sweeps) server_out_path = os.path.join(self.result_dir, "server.out") self._updateStat(sweeper.stats()) # Loop on the number of nodes while True: # Taking the next parameter combinations comb = sweeper.get_next() if not comb: break # Performing the submission on G5K site = get_cluster_site(cluster) self._log("Output will go to " + self.result_dir) n_nodes = int( math.ceil( float(comb['n_clients']) / EX5.get_host_attributes( cluster + '-1')['architecture']['smt_size'])) + 1 self._log("Reserving {0} nodes on {1}".format(n_nodes, site)) resources = "{cluster=\\'" + cluster + "\\'}/nodes=" + str(n_nodes) submission = EX5.OarSubmission(resources=resources, job_type='allow_classic_ssh', walltime='00:10:00') job = EX5.oarsub([(submission, site)]) self.__class__._job = job # Sometimes oarsub fails silently if job[0][0] is None: print("\nError: no job was created") sys.exit(1) # Wait for the job to start self._log( "Waiting for job {0} to start...\n".format(BOLD_MAGENTA + str(job[0][0]) + NORMAL)) EX5.wait_oar_job_start(job[0][0], job[0][1], prediction_callback=prediction) nodes = EX5.get_oar_job_nodes(job[0][0], job[0][1]) # Deploying nodes #deployment = EX5.Deployment(hosts = nodes, env_file='path_to_env_file') #run_deploy = EX5.deploy(deployment) #nodes_deployed = run_deploy.hosts[0] # Copying active_data program on all deployed hosts EX.Put([nodes[0]], '../dist/active-data-lib-0.1.2.jar', connexion_params={ 'user': '******' }).run() EX.Put([nodes[0]], '../server.policy', connexion_params={ 'user': '******' }).run() # Loop on the number of requests per client process while True: # Split the nodes clients = nodes[1:] server = nodes[0] self._log( "Running experiment with {0} nodes and {1} transitions per client" .format(len(clients), comb['n_transitions'])) # Launching Server on one node out_handler = FileOutputHandler(server_out_path) launch_server = EX.Remote( 'java -jar active-data-lib-0.1.2.jar', [server], stdout_handler=out_handler, stderr_handler=out_handler).start() self._log("Server started on " + server.address) time.sleep(2) # Launching clients rank = 0 n_cores = EX5.get_host_attributes( clients[0])['architecture']['smt_size'] cores = nodes * n_cores cores = cores[ 0:comb['n_clients']] # Cut out the additional cores client_connection_params = { 'taktuk_gateway': 'lyon.grid5000.fr', 'host_rewrite_func': None } self._log("Launching {0} clients...".format(len(cores))) client_cmd = "/usr/bin/env java -cp /home/ansimonet/active-data-lib-0.1.2.jar org.inria.activedata.examples.perf.TransitionsPerSecond " + \ "{0} {1} {2} {3} {4}".format(server.address, 1200, "{{range(len(cores))}}", len(cores), comb['n_transitions']) client_out_handler = FileOutputHandler( os.path.join(self.result_dir, "clients.out")) client_request = EX.TaktukRemote(client_cmd, cores, connexion_params = client_connection_params, \ stdout_handler = client_out_handler, stderr_handler = client_out_handler) client_request.run() if not client_request.ok(): # Some client failed, please panic self._log( "One or more client process failed. Enjoy reading their outputs." ) self._log( "OUTPUT STARTS -------------------------------------------------\n" ) for process in client_request.processes(): print("----- {0} returned {1}".format( process.host().address, process.exit_code())) if not process.stdout() == "": print(GREEN + process.stdout() + NORMAL) if not process.stderr() == "": print(RED + process.stderr() + NORMAL) print("") self._log( "OUTPUT ENDS ---------------------------------------------------\n" ) sweeper.skip(comb) launch_server.kill() launch_server.wait() else: # Waiting for server to end launch_server.wait() # Getting log files distant_path = OUT_FILE_FORMAT.format( len(cores), comb['n_transitions']) local_path = distant_path EX.Get([server], distant_path).run() EX.Local('mv ' + distant_path + ' ' + os.path.join(self.result_dir, local_path)).run() EX.Get([server], 'client_*.out', local_location=self.result_dir) EX.Remote('rm -f client_*.out', [server]) self._log( "Finishing experiment with {0} clients and {1} transitions per client" .format(comb['n_clients'], comb['n_transitions'])) sweeper.done(comb) sub_comb = sweeper.get_next(filtr=lambda r: filter( lambda s: s["n_clients"] == comb['n_clients'], r)) self._updateStat(sweeper.stats()) if not sub_comb: # Killing job EX5.oar.oardel(job) self.__class__._job = None break else: comb = sub_comb print ""
def campaign(broker, provider, conf, test, env): def generate_id(params): def clean(s): return str(s).replace("/", "_sl_") \ .replace(":", "_sc_") return "-".join([ "%s__%s" % (clean(k), clean(v)) for k, v in sorted(params.items()) ]) def accept(params): call_ratio_max = 3 cast_ratio_max = 3 call_type = params["call_type"] if params["nbr_servers"] > params["nbr_clients"]: return False if call_type == "rpc-call": if not params["pause"]: # maximum rate return call_ratio_max * params["nbr_servers"] >= params[ "nbr_clients"] else: # we can afford more clients # based on our estimation a client sends 200msgs at full rate return call_ratio_max * params["nbr_servers"] >= params[ "nbr_clients"] * 200 * params["pause"] else: if not params["pause"]: # maximum rate return cast_ratio_max * params["nbr_servers"] >= params[ "nbr_clients"] else: # we can afford more clients # based on our estimation a client sends 200msgs at full rate return cast_ratio_max * params["nbr_servers"] >= params[ "nbr_clients"] * 1000 * params["pause"] # Function to pass in parameter to ParamSweeper.get_next() # Give the illusion that the Set of params is sorted by nbr_clients def sort_params_by_nbr_clients(set): return sorted((list(set)), key=lambda k: k['nbr_clients']) # Dump each params in the backup dir def dump_param(params): if not os.path.exists("%s/params.json" % test): with open("%s/params.json" % test, 'w') as outfile: json.dump([], outfile) #Add the current params to the json with open("%s/params.json" % test, 'r') as outfile: all_params = json.load(outfile) all_params.append(params) with open("%s/params.json" % test, 'w') as outfile: json.dump(all_params, outfile) # Loading the conf config = {} with open(conf) as f: config = yaml.load(f) parameters = config["campaign"][test] sweeps = sweep(parameters) filtered_sweeps = [param for param in sweeps if accept(param)] sweeper = ParamSweeper( # Maybe puts the sweeper under the experimentation directory # This should be current/sweeps persistence_dir=os.path.join("%s/sweeps" % test), sweeps=filtered_sweeps, save_sweeps=True, name=test) params = sweeper.get_next(sort_params_by_nbr_clients) PROVIDERS[provider](broker=broker, config=config, env=test) t.inventory() while params: params.pop("backup_dir", None) params.update({"backup_dir": generate_id(params)}) t.prepare(broker=broker) t.test_case_1(**params) sweeper.done(params) dump_param(params) params = sweeper.get_next(sort_params_by_nbr_clients) t.destroy()
class overturn(Engine): def create_sweeper(self): """Define the parameter space and return a sweeper.""" parameters = { 'RA': ['1.e5', '1.e6'], 'RCMB' : [1.19, 3.29], 'KFe' : [0.85, 0.9] } sweeps = sweep(parameters) self.sweeper = ParamSweeper(os.path.join(self.result_dir, "sweeps"), sweeps) def create_par_file(self, comb): """Create Run directory on remote server and upload par file""" logger.info('Creating par file') comb_dir = parent_dir + slugify(comb) + '/' logger.info('comb_dir = ' + comb_dir) # Create remote directories mdir = sp.call('mkdir -p ' + comb_dir + 'Img ; mkdir -p ' + comb_dir + 'Op ; ', shell=True) # Generate par file par_file = 'par_' + slugify(comb) nml = f90nml.read('template.nml') nml['refstate']['ra0'] = float(comb['RA']) nml['tracersin']['K_Fe'] = comb['KFe'] nml['geometry']['r_cmb'] = comb['RCMB'] nztot = min(int(2**(math.log10(float(comb['RA']))+1)), 128) nml['geometry']['nztot'] = nztot nml['geometry']['nytot'] = int(math.pi*(comb['RCMB']+0.5)*nztot) nml.write(par_file, force=True) logger.info('Created par file ' + par_file) # Upload par file to remote directory cpar = sp.call('cp ' + par_file + ' ' + comb_dir, shell=True) mpar = sp.call('cd ' + comb_dir + ' ; mv ' + par_file+ ' par', shell=True) logger.info('Done') def submit_job(self, comb): """Use the batch script""" logger.info('Submiting job on '+ jobserver) comb_dir = parent_dir + slugify(comb) + '/' job_sub = sp.Popen('cd ' + comb_dir + ' ; /usr/local/bin/qsub /home/stephane/ExamplePBS/batch_single', shell=True, stdout=sp.PIPE, stderr=sp.STDOUT) return job_sub.stdout.readlines()[-1].split('.')[0] def workflow(self, comb): self.create_par_file(comb) job_id = self.submit_job(comb) logger.info('Combination %s will be treated by job %s', slugify(comb), str(job_id)) self.sweeper.done(comb) def run(self): self.create_sweeper() logger.info('%s parameters combinations to be treated', len(self.sweeper.get_sweeps())) threads = [] while len(self.sweeper.get_remaining()) > 0: comb = self.sweeper.get_next() t = Thread(target=self.workflow, args=(comb,)) t.daemon = True threads.append(t) t.start() for t in threads: t.join()
class oar_replay_workload(Engine): def init(self): parser = self.options_parser parser.add_option('--is_a_test', dest='is_a_test', action='store_true', default=False, help='prefix the result folder with "test", enter ' 'a debug mode if fails and remove the job ' 'afterward, unless it is a reservation') parser.add_option('--already_configured', dest='already_configured', action='store_true', default=False, help='if set, the OAR cluster is not re-configured') parser.add_option('--reservation_id', help="Grid'5000 reservation job ID") parser.add_argument('experiment_config', 'The config JSON experiment description file') def setup_result_dir(self): is_a_test = self.options.is_a_test run_type = "" if is_a_test: run_type = "test_" self.result_dir = script_path + '/' + run_type + 'results_' + \ time.strftime("%Y-%m-%d--%H-%M-%S") logger.info('resutlt directory: {}'.format(self.result_dir)) def run(self): """Run the experiment""" already_configured = self.options.already_configured reservation_job_id = int(self.options.reservation_id) \ if self.options.reservation_id is not None else None is_a_test = self.options.is_a_test if is_a_test: logger.warn('THIS IS A TEST! This run will use only a few ' 'resources') # make the result folder writable for all os.chmod(self.result_dir, 0o777) # Import configuration with open(self.args[0]) as config_file: config = json.load(config_file) # backup configuration copy(self.args[0], self.result_dir) site = config["grid5000_site"] resources = config["resources"] nb_experiment_nodes = config["nb_experiment_nodes"] walltime = str(config["walltime"]) env_name = config["kadeploy_env_name"] workloads = config["workloads"] # check if workloads exists (Suppose that the same NFS mount point # is present on the remote and the local environment for workload_file in workloads: with open(workload_file): pass # copy the workloads files to the results dir copy(workload_file, self.result_dir) # define the workloads parameters self.parameters = {'workload_filename': workloads} logger.info('Workloads: {}'.format(workloads)) # define the iterator over the parameters combinations self.sweeper = ParamSweeper(os.path.join(self.result_dir, "sweeps"), sweep(self.parameters)) # Due to previous (using -c result_dir) run skip some combination logger.info('Skipped parameters:' + '{}'.format(str(self.sweeper.get_skipped()))) logger.info('Number of parameters combinations {}'.format( str(len(self.sweeper.get_remaining())))) logger.info('combinations {}'.format(str( self.sweeper.get_remaining()))) if reservation_job_id is not None: jobs = [(reservation_job_id, site)] else: jobs = oarsub([(OarSubmission(resources=resources, job_type='deploy', walltime=walltime), site)]) job_id, site = jobs[0] if job_id: try: logger.info("waiting job start %s on %s" % (job_id, site)) wait_oar_job_start(job_id, site, prediction_callback=prediction_callback) logger.info("getting nodes of %s on %s" % (job_id, site)) nodes = get_oar_job_nodes(job_id, site) # sort the nodes nodes = sorted(nodes, key=lambda node: node.address) # get only the necessary nodes under the switch if nb_experiment_nodes > len(nodes): raise RuntimeError('The number of given node in the ' 'reservation ({}) do not match the ' 'requested resources ' '({})'.format(len(nodes), nb_experiment_nodes)) nodes = nodes[:nb_experiment_nodes] logger.info("deploying nodes: {}".format(str(nodes))) deployed, undeployed = deploy( Deployment(nodes, env_name=env_name), check_deployed_command=already_configured) if undeployed: logger.warn("NOT deployed nodes: {}".format( str(undeployed))) raise RuntimeError('Deployement failed') if not already_configured: # install OAR install_cmd = "apt-get update; apt-get install -y " node_packages = "oar-node" logger.info("installing OAR nodes: {}".format( str(nodes[1:]))) install_oar_nodes = Remote( install_cmd + node_packages, nodes[1:], connection_params={'user': '******'}) install_oar_nodes.start() server_packages = ( "oar-server oar-server-pgsql oar-user " "oar-user-pgsql postgresql python3-pip " "libjson-perl postgresql-server-dev-all") install_oar_sched_cmd = """ mkdir -p /opt/oar_sched; \ cd /opt/oar_sched; \ git clone https://github.com/oar-team/oar3.git; \ cd oar3; \ git checkout dce942bebc2; \ pip3 install -e .; \ cd /usr/lib/oar/schedulers; \ ln -s /usr/local/bin/kamelot; \ pip3 install psycopg2 """ logger.info("installing OAR server node: {}".format( str(nodes[0]))) install_master = SshProcess( install_cmd + server_packages + ";" + install_oar_sched_cmd, nodes[0], connection_params={'user': '******'}) install_master.run() install_oar_nodes.wait() if not install_master.ok: Report(install_master) configure_oar_cmd = """ sed -i \ -e 's/^\(DB_TYPE\)=.*/\\1="Pg"/' \ -e 's/^\(DB_HOSTNAME\)=.*/\\1="localhost"/' \ -e 's/^\(DB_PORT\)=.*/\\1="5432"/' \ -e 's/^\(DB_BASE_PASSWD\)=.*/\\1="oar"/' \ -e 's/^\(DB_BASE_LOGIN\)=.*/\\1="oar"/' \ -e 's/^\(DB_BASE_PASSWD_RO\)=.*/\\1="oar_ro"/' \ -e 's/^\(DB_BASE_LOGIN_RO\)=.*/\\1="oar_ro"/' \ -e 's/^\(SERVER_HOSTNAME\)=.*/\\1="localhost"/' \ -e 's/^\(SERVER_PORT\)=.*/\\1="16666"/' \ -e 's/^\(LOG_LEVEL\)\=\"2\"/\\1\=\"3\"/' \ -e 's#^\(LOG_FILE\)\=.*#\\1="{result_dir}/oar.log"#' \ -e 's/^\(JOB_RESOURCE_MANAGER_PROPERTY_DB_FIELD\=\"cpuset\".*\)/#\\1/' \ -e 's/^#\(CPUSET_PATH\=\"\/oar\".*\)/\\1/' \ -e 's/^\(FINAUD_FREQUENCY\)\=.*/\\1="0"/' \ /etc/oar/oar.conf """.format(result_dir=self.result_dir) configure_oar = Remote(configure_oar_cmd, nodes, connection_params={'user': '******'}) configure_oar.run() logger.info("OAR is configured on all nodes") # Configure server create_db = "oar-database --create --db-is-local" config_oar_sched = ( "oarnotify --remove-queue default;" "oarnotify --add-queue default,1,kamelot") start_oar = "systemctl start oar-server.service" logger.info("configuring OAR database: {}".format( str(nodes[0]))) config_master = SshProcess( create_db + ";" + config_oar_sched + ";" + start_oar, nodes[0], connection_params={'user': '******'}) config_master.run() # propagate SSH keys logger.info("configuring OAR SSH") oar_key = "/tmp/.ssh" Process('rm -rf ' + oar_key).run() Process( 'scp -o BatchMode=yes -o PasswordAuthentication=no ' '-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null ' '-o ConnectTimeout=20 -rp -o User=root ' + nodes[0].address + ":/var/lib/oar/.ssh" ' ' + oar_key).run() # Get(nodes[0], "/var/lib/oar/.ssh", [oar_key], connection_params={'user': '******'}).run() Put(nodes[1:], [oar_key], "/var/lib/oar/", connection_params={ 'user': '******' }).run() add_resources_cmd = """ oarproperty -a cpu || true; \ oarproperty -a core || true; \ oarproperty -c -a host || true; \ oarproperty -a mem || true; \ """ for node in nodes[1:]: add_resources_cmd = add_resources_cmd + "oarnodesetting -a -h {node} -p host={node} -p cpu=1 -p core=4 -p cpuset=0 -p mem=16; \\\n".format( node=node.address) add_resources = SshProcess( add_resources_cmd, nodes[0], connection_params={'user': '******'}) add_resources.run() if add_resources.ok: logger.info("oar is now configured!") else: raise RuntimeError( "error in the OAR configuration: Abort!") # TODO backup de la config de OAR # Do the replay logger.info('begining the replay') while len(self.sweeper.get_remaining()) > 0: combi = self.sweeper.get_next() workload_file = os.path.basename( combi['workload_filename']) oar_replay = SshProcess( script_path + "/oar_replay.py " + combi['workload_filename'] + " " + self.result_dir + " oar_gant_" + workload_file, nodes[0]) oar_replay.stdout_handlers.append(self.result_dir + '/' + workload_file + '.out') logger.info("replaying workload: {}".format(combi)) oar_replay.run() if oar_replay.ok: logger.info("Replay workload OK: {}".format(combi)) self.sweeper.done(combi) else: logger.info("Replay workload NOT OK: {}".format(combi)) self.sweeper.cancel(combi) raise RuntimeError("error in the OAR replay: Abort!") except: traceback.print_exc() ipdb.set_trace() finally: if is_a_test: ipdb.set_trace() if reservation_job_id is None: logger.info("delete job: {}".format(jobs)) oardel(jobs)
class DVFS(Engine): def __init__(self, result_dir, cluster, site): Engine.__init__(self) self.result_dir = result_dir self.cluster = cluster self.site = site def run(self): """Inherited method, put here the code for running the engine""" self.define_parameters() self.run_xp() def define_parameters(self): nbNodes = len(self.cluster) # build parameters and make nbCore list per benchmark freqList = [2534000, 2000000, 1200000] n_nodes = float(len(self.cluster)) max_core = SshProcess('cat /proc/cpuinfo | grep -i processor |wc -l', self.cluster[0], connection_params={ 'user': '******' }).run().stdout max_core = n_nodes * float(max_core) even = filter( lambda i: i > n_nodes, list(takewhile(lambda i: i < max_core, (2**i for i in count(0, 1))))) powerTwo = filter( lambda i: i > n_nodes, list(takewhile(lambda i: i < max_core, (i**2 for i in count(0, 1))))) # Define parameters self.parameters = { 'Repeat': [1], "Freq": [2534000], "NPBclass": ['C'], "Benchmark": { # 'ft': { # 'n_core': even # }, # 'ep': { # 'n_core': even # }, # 'lu': { # 'n_core': even # }, # 'is': { # 'n_core': even # }, # 'sg': { # 'n_core': even # }, # 'bt': { # 'n_core': powerTwo # }, 'sp': { 'n_core': powerTwo } } } logger.info(self.parameters) # make all possible parameters object, self.sweeper = ParamSweeper(os.path.join(self.result_dir, "sweeps"), sweep(self.parameters)) logger.info('Number of parameters combinations %s', len(self.sweeper.get_remaining())) def run_xp(self): master = self.cluster[0] opt = '' """Iterate over the parameters and execute the bench""" while len(self.sweeper.get_remaining()) > 0: # Take sweeper comb = self.sweeper.get_next() logger.info('Processing new combination %s' % (comb, )) try: # metric from linux sar tools, works with clock def takeMetric( path, startTime, endTime, metric=['cpu', 'mem', 'disk', 'swap', 'network']): opt = '' cmd_template_sar = ( "sar -f /var/log/sysstat/sa* -{opt} -s {startTime} -e {endTime}" ) for met in metric: if met == 'cpu': opt = 'u' elif met == 'mem': opt = 'r' elif met == 'disk': opt = 'dp' elif met == 'swap': opt = 'S' elif met == 'network': opt = 'n DEV' cmd = cmd_template_sar.format(opt=opt, startTime=startTime, endTime=endTime) for host in self.cluster: hE = SshProcess(cmd, host, connection_params={'user': '******'}) hE.run() stdMetric = host + '-' + met + '.txt' with open(os.path.join(path, stdMetric), "w") as sout: sout.write(hE.stdout) #Set CPU Freq and Policy according current combination cmd_template_Freq_Policy = ("cpufreq-set -r -g {policy}") cmd_template_Freq = ("cpufreq-set -r -f {freq}") if comb['Freq'] == 'OnDemand': cmd_freq_policy = cmd_template_Freq_Policy.format( policy='ondemand') Remote(cmd_freq_policy, master, connection_params={ 'user': '******' }).run() elif comb['Freq'] == 'conservative': cmd_freq_policy = cmd_template_Freq_Policy.format( policy='conservative') Remote(cmd_freq_policy, master, connection_params={ 'user': '******' }).run() else: cmd_freq_policy = cmd_template_Freq_Policy.format( policy='userspace') Remote(cmd_freq_policy, master, connection_params={ 'user': '******' }).run() cmd_freq = cmd_template_Freq.format(freq=comb['Freq']) Remote(cmd_freq, master, connection_params={ 'user': '******' }).run() # build command src = 'source /opt/intel-performance-snapshoot/apsvars.sh' cmd_mpirun_template = ( "mpirun {opt} -f /root/cluster.txt -np {pr1} aps -r '/tmp/log/' /tmp/NPB/npb-mpi/bin/{typeNPB}.{NPBclass}.{pr2}" ) cmd_mpirun = cmd_mpirun_template.format( opt='', pr1=comb['n_core'], typeNPB=comb['Benchmark'], NPBclass=comb['NPBclass'], pr2=comb['n_core']) cmd = "{}; /tmp/NPB/bin/runMPI.sh '{}' '{}'".format( src, cmd_mpirun, slugify(comb)) curPath = self.result_dir + slugify(comb) # run Mpi through execo remote SshProcess def runMpi(cmd): act = SshProcess(cmd, master, connection_params={'user': '******'}, shell=True) act.run() if not os.path.exists(curPath): os.makedirs(curPath) with open(os.path.join(curPath, "stdout.txt"), "a+") as sout, open( os.path.join(curPath, "stderr.txt"), "w") as serr: sout.write(act.stdout) serr.write(act.stderr) return act.ok # start clock and exec command in the master node time.sleep(5) startUnix = int(time.time()) start24Hour = datetime.datetime.fromtimestamp( startUnix).strftime('%H:%M:%S') task1 = runMpi(cmd) endUnix = int(time.time()) end24Hour = datetime.datetime.fromtimestamp(endUnix).strftime( '%H:%M:%S') time.sleep(5) with open(os.path.join(curPath, "executionTime.txt"), "w") as sout: sout.write( 'ExecTime:{}\nStartDate:{}\nEndDate:{}\n'.format( str(endUnix - startUnix), start24Hour, end24Hour)) takeMetric(curPath, start24Hour, end24Hour, ['cpu', 'mem', 'disk', 'swap', 'network']) # collect power from kWAPI: grid5000 infrastructure made tool for hostname in self.cluster: powerOut = '{}_power'.format(hostname) collect_metric(startUnix, endUnix, 'power', curPath, self.site, powerOut, hostname) st = '/tmp/out/' + slugify(comb) intelAppPerf = str(st + '.html') # get the data from ['Application Performance Snapshot', 'Storage Performance Snapshot'] # https://software.intel.com/en-us/performance-snapshot Get(master, [intelAppPerf], curPath, connection_params={ 'user': '******' }).run() if task1: logger.info("comb ok: %s" % (comb, )) self.sweeper.done(comb) continue except OSError as err: print("OS error: {0}".format(err)) except ValueError: print("Could not convert data to an integer.") except: print("Unexpected error:", sys.exc_info()[0]) raise logger.info("comb NOT ok: %s" % (comb, )) self.sweeper.cancel(comb)
@enostask() def backup(env=None): LOG.info(f"Running backup on {env['roles']}") @enostask() def destroy(env=None): LOG.info(f"Running destroy on {env['roles']}") # Iterate over a set of parameters parameters = {"param1": [1, 4], "param2": ["a", "b"]} sweeps = sweep(parameters) sweeper = ParamSweeper( persistence_dir=str(Path("sweeps")), sweeps=sweeps, save_sweeps=True ) parameter = sweeper.get_next() while parameter: try: deploy() bench(parameter) backup() sweeper.done(parameter) except Exception as e: traceback.print_exc() sweeper.skip(parameter) finally: destroy() parameter = sweeper.get_next()