this job demonstrate how to utilize variablized configuration """, ) wrapper.add_plan(Job.INIT_JOB, Job.START, "hello Serious") wrapper.add_plan("hello Serious", Job.DONE, "hello Kidding") wrapper.add_plan("hello Kidding", Job.DONE, "Serious") wrapper.add_plan("Serious", Job.DONE, "Kidding") wrapper.add_plan("Kidding", Job.DONE, Job.LAST_JOB) """ same as previous tutorial but we declare the output, 'msg_to_[name]',which represent the message to be kept. the callback are also modified. """ # == j_temp = JobNode(id="hello template", desc="say hello to someone") j_temp.need_input("msg", "hello! Mr.[name]") j_temp.need_output("msg_to_[name]") j_temp.set_callback(hello_job) """ remember we mentioned in the tutorial_04 that all the inputs should be explicitly declared. same as output. actually, it's fine if you don't declare the outputs; the process will still be executed correctly. however, this strategy is trying to improve the readability of the code. a person just take your code may be not familiar with the flow. the declared outputs will be listed in the generated document and help the folk to catch the key concepts of the job. """ """
Job.DELEGATEES['my_shell'] = Shell() ''' ''' wrapper = JobBlock( 'entry job', ''' this job demonstrate how to use delegatees, say DFS or Pig ''') wrapper.add_plan(Job.INIT_JOB, Job.START, 'hadoop delegatee') wrapper.add_plan('hadoop delegatee', Job.DONE, 'wrong command') wrapper.add_plan('wrong command', Job.DONE, Job.LAST_JOB) ''' prepare the jobs ''' j = JobNode(id='hadoop delegatee', desc=''' cat some file on the dfs (to run this tutorial, you have to prepare your own data on the dfs) ''') j.set_callback(delegated_job) wrapper.add_sub_job(j) # == j = JobNode(id='wrong command', desc=''' execute some error command ''') j.set_callback(failed_delegated_job) wrapper.add_sub_job(j) ''' run this tutorial on the Hadoop system ''' # ==
''' wrapper = JobBlock( 'entry job', ''' this job demonstrate how to use configuration mechanism for input data ''') wrapper.add_plan(Job.INIT_JOB, Job.START, 'hello Serious') wrapper.add_plan('hello Serious', Job.DONE, 'hello Kidding') wrapper.add_plan('hello Kidding', Job.DONE, Job.LAST_JOB) ''' first, we build a template/prototype job for the hello jobs and assign a key-value pair input. the input could be access in the callback by self.get_input(<key_of_the_input>). note that we bracket the name in the config value. it's a variablized config. we will explain it later. ''' # == j_temp = JobNode(id='template', desc='say hello to someone') j_temp.need_input('msg', 'hello! Mr.[name]') j_temp.set_callback(hello_job) ''' instead of directly add the template job into wrapper ''' # wrapper.add_sub_job(j_temp) ''' we make two copies from the templates and give the correct id and description. then, we assign the name to each job. you may guess the result - the input of template job, "msg", would be "completed" by replacing the "[name]" with the actual value we assign to the each job. ''' # == j = deepcopy(j_temp) j.id = 'hello Serious'
wrapper = JobBlock( 'entry job', ''' this job demonstrate how to utilize variablized configuration ''') wrapper.add_plan(Job.INIT_JOB, Job.START, 'hello Serious') wrapper.add_plan('hello Serious', Job.DONE, 'hello Kidding') wrapper.add_plan('hello Kidding', Job.DONE, 'Serious') wrapper.add_plan('Serious', Job.DONE, 'Kidding') wrapper.add_plan('Kidding', Job.DONE, Job.LAST_JOB) ''' same as previous tutorial but we declare the output, 'msg_to_[name]',which represent the message to be kept. the callback are also modified. ''' # == j_temp = JobNode(id='hello template', desc='say hello to someone') j_temp.need_input('msg', 'hello! Mr.[name]') j_temp.need_output('msg_to_[name]') j_temp.set_callback(hello_job) ''' remember we mentioned in the tutorial_04 that all the inputs should be explicitly declared. same as output. actually, it's fine if you don't declare the outputs; the process will still be executed correctly. however, this strategy is trying to improve the readability of the code. a person just take your code may be not familiar with the flow. the declared outputs will be listed in the generated document and help the folk to catch the key concepts of the job. ''' ''' same as previous tutorial '''
''' wrapper = JobBlock('entry job', ''' this job demonstrate how to use configuration mechanism for input data ''') wrapper.add_plan(Job.INIT_JOB, Job.START, 'hello Serious') wrapper.add_plan('hello Serious', Job.DONE, 'hello Kidding') wrapper.add_plan('hello Kidding', Job.DONE, Job.LAST_JOB) ''' first, we build a template/prototype job for the hello jobs and assign a key-value pair input. the input could be access in the callback by self.get_input(<key_of_the_input>). note that we bracket the name in the config value. it's a variablized config. we will explain it later. ''' # == j_temp = JobNode(id='template', desc='say hello to someone') j_temp.need_input('msg', 'hello! Mr.[name]') j_temp.set_callback(hello_job) ''' instead of directly add the template job into wrapper ''' # wrapper.add_sub_job(j_temp) ''' we make two copies from the templates and give the correct id and description. then, we assign the name to each job. you may guess the result - the input of template job, "msg", would be "completed" by replacing the "[name]" with the actual value we assign to the each job. ''' # == j = deepcopy(j_temp)
Job.DELEGATEES['my_shell'] = Shell() ''' ''' wrapper = JobBlock('entry job', ''' this job demonstrate how to use delegatees, say DFS or Pig ''') wrapper.add_plan(Job.INIT_JOB, Job.START, 'hadoop delegatee') wrapper.add_plan('hadoop delegatee', Job.DONE, 'wrong command') wrapper.add_plan('wrong command', Job.DONE, Job.LAST_JOB) ''' prepare the jobs ''' j = JobNode(id='hadoop delegatee', desc=''' cat some file on the dfs (to run this tutorial, you have to prepare your own data on the dfs) ''') j.set_callback(delegated_job) wrapper.add_sub_job(j) # == j = JobNode(id='wrong command', desc=''' execute some error command ''') j.set_callback(failed_delegated_job) wrapper.add_sub_job(j) ''' run this tutorial on the Hadoop system ''' # ==
wrapper.add_plan(Job.INIT_JOB, Job.START, "job0") wrapper.add_plan("job0", Job.DONE, "job1") wrapper.add_plan("job1", Job.DONE, Job.LAST_JOB) """ now we start to plan the detail of each job. each job should have a id and a paragraph of desc(ription) which will be generated into document and you won't be bother to prepare any other document. this mechanism helps the code to be kept alive. the job we need here are some very simple job. let's say we wanna print something in each job, so we don't need to prepare any input. (we leave this to other tutorial codes.) so we assign a "callback" method, normal_job, to the job. now you could check the callbacks in the beginning of this code. """ # == j = JobNode(id="job0", desc="desc0") j.set_callback(normal_job) wrapper.add_sub_job(j) # == j = JobNode(id="job1", desc="desc1") j.set_callback(normal_job) wrapper.add_sub_job(j) # == """ things are almost done. all we need to do is to trigger the execution! check the result to re-exame the flow of the process """ # == job_id, state = wrapper.execute()
''' # == j = ParallelJobBlock(id='para block1', desc='para block1') j.add_papallel_plan('job0','job1') wrapper.add_sub_job(j) ''' then, we define the inner JobNodes. this time, we want let the job print something and sleep for a while few times. because both parallel jobs will print messages, applying one buffer will result in a mass. therefore, the flow engine will prepare one buffer for each parallel job. at the end of the job, the parent job will dump the children buffers sequentially. (first done, first dump) ''' # == j_sub = JobNode(id='job0',desc='desc0') j_sub.set_callback(lazy_job) j.add_sub_job(j_sub) # == j_sub = JobNode(id='job1',desc='desc1') j_sub.set_callback(lazy_job) j.add_sub_job(j_sub) # == ''' check the result to re-exame the flow of the process ''' # == job_id, state = wrapper.execute() #raw_input()
wrapper = JobBlock( 'entry job', ''' this job demonstrate how to use dry run mechanism ''') wrapper.add_plan(Job.INIT_JOB, Job.START, 'foo') wrapper.add_plan('foo', Job.DONE, 'block') wrapper.add_plan('block', Job.DONE, 'fob') wrapper.add_plan('fob', Job.DONE, Job.LAST_JOB) ''' now, we enable a secret switch to tell the whole process in the dry run mode ''' wrapper.set_dry_run(True) ''' prepare job sea ''' j = JobNode(id='foo', desc=''' foo ''') j.set_callback(foo_job) wrapper.add_sub_job(j) # == j = JobBlock(id='block', desc=''' block ''') j.add_plan(Job.INIT_JOB, Job.START, 'bar') j.add_plan('bar', Job.DONE, 'foobar') j.add_plan('foobar', Job.DONE, Job.LAST_JOB) # -- j_sub = JobNode(id='bar', desc=''' bar ''') j_sub.set_callback(foo_job) j.add_sub_job(j_sub) # -- j_sub = JobNode(id='foobar', desc=''' foobar ''') j_sub.set_callback(foobar_job) j.add_sub_job(j_sub)
map(lambda key: Job.set_global(key, CFG[key]), configs_for_jobs.keys()) wrapper = JobBlock( 'entry job', ''' this job demonstrate how to use config management module ''') wrapper.add_plan(Job.INIT_JOB, Job.START, 'foo') wrapper.add_plan('foo', Job.DONE, Job.LAST_JOB) ''' we could get the configs we just set as global by giving the key without value or, we could put it into some other input here we also introduce another usage of output: in the tutorial_04, we set the key of output without value; that's a kind of declaration to exclaim 'we will put some value with that key as the output. (and the later jobs could access it as input) this time, we do give value to output key because we want the job output something to the path we expected. ''' j = JobNode(id='foo', desc=''' foo ''') j.need_input('a very long path blah blah') j.need_input( 'composite path', '[another very long path blah blah]/append_with_a_sub_directory') j.need_output('output_path', '[yet another very long path blah blah]') j.set_callback(foo_job) wrapper.add_sub_job(j) job_id, state = wrapper.execute() #raw_input()
map(lambda key: Job.set_global(key, CFG[key]), configs_for_jobs.keys()) wrapper = JobBlock('entry job', ''' this job demonstrate how to use config management module ''') wrapper.add_plan(Job.INIT_JOB, Job.START, 'foo') wrapper.add_plan('foo', Job.DONE, Job.LAST_JOB) ''' we could get the configs we just set as global by giving the key without value or, we could put it into some other input here we also introduce another usage of output: in the tutorial_04, we set the key of output without value; that's a kind of declaration to exclaim 'we will put some value with that key as the output. (and the later jobs could access it as input) this time, we do give value to output key because we want the job output something to the path we expected. ''' j = JobNode(id='foo', desc=''' foo ''') j.need_input('a very long path blah blah') j.need_input('composite path', '[another very long path blah blah]/append_with_a_sub_directory') j.need_output('output_path','[yet another very long path blah blah]') j.set_callback(foo_job) wrapper.add_sub_job(j) job_id, state = wrapper.execute() #raw_input()
this job demonstrate how to use dry run mechanism ''') wrapper.add_plan(Job.INIT_JOB, Job.START, 'foo') wrapper.add_plan('foo', Job.DONE, 'block') wrapper.add_plan('block', Job.DONE, 'fob') wrapper.add_plan('fob', Job.DONE, Job.LAST_JOB) ''' now, we enable a secret switch to tell the whole process in the dry run mode ''' wrapper.set_dry_run(True) ''' prepare job sea ''' j = JobNode(id='foo', desc=''' foo ''') j.set_callback(foo_job) wrapper.add_sub_job(j) # == j = JobBlock(id='block', desc=''' block ''') j.add_plan(Job.INIT_JOB, Job.START, 'bar') j.add_plan('bar', Job.DONE, 'foobar') j.add_plan('foobar', Job.DONE, Job.LAST_JOB) # -- j_sub = JobNode(id='bar', desc=''' bar ''') j_sub.set_callback(foo_job) j.add_sub_job(j_sub) # -- j_sub = JobNode(id='foobar', desc=''' foobar ''') j_sub.set_callback(foobar_job) j.add_sub_job(j_sub)
''' first, with the top-down design strategy, we define JobBlock, which is like wrapper with its own plan. ''' # == j = JobBlock(id='block1', desc='block1') j.add_plan(Job.INIT_JOB, Job.START, 'job0') j.add_plan('job0', Job.DONE, 'job1') j.add_plan('job1', Job.DONE, Job.LAST_JOB) wrapper.add_sub_job(j) ''' then, we define the inner JobNodes (same as previous tutorial) ''' # == j_sub = JobNode(id='job0',desc='desc0') j_sub.set_callback(normal_job) j.add_sub_job(j_sub) # == j_sub = JobNode(id='job1',desc='desc1') j_sub.set_callback(normal_job) j.add_sub_job(j_sub) # == ''' BTW, here's a small tips. while designing a large flow, you may want to well-organize your code by putting the related things together. but sometimes, you can't assign the value you want right after the job is initiated because the value should be calculated/generated later. we provide the flexibility to delay the manipulation.