Skip to content

A dynamic data pipeline building api by ordering tasks in sequence. The data pipeline is build with Apache Spark and Python.

Notifications You must be signed in to change notification settings

HasifSubair/TaskRecipes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Task Recipes

In Task Recipes, we are building a dynamic data pipeline API by ordering tasks in sequence. Tasks are unit of operation with in the pipeline. All tasks must extend of 'pipeline.task.Task' and override its execute method.

The tasks passed as ordered arguments to 'pipeline.executor.py' script.

An example argument:

<APP_NAME> HTTP_READ <HTTP/HTTPS path> RECIPES_TRANSFORM WRITE json file:///Users/hasif/Learning/recipe/ BEEF_RECIPE_TRANSFORM PARTITION_WRITE hdfs://

  • <APP_NAME> - Application Name.
  • HTTP_READ <HTTP/HTTPS path> - This task reads data from the Http/Https url, type of the data is also passed as an argument.
  • RECIPES_TRANSFORM This task applies transformation given in this exercise to the entire dataset.
  • WRITE file://<LOCAL_PATH> This task writes the result from previous transformation to the given local path with the specified type.
  • BEEF_RECIPE_TRANSFORM This task applies transformation given in this exercise to dataset after filtering the 'ingredients' column for keyword 'beef'.
  • PARTITION_WRITE hdfs:// Task writes to hdfs in the specified format and location.

For example python executor.py ingest_recipe HTTP_READ json <INPUT_PATH> WRITE json file://<LOCAL_PATH> python executor.py ingest_recipe HTTP_READ <INPUT_PATH> BEEF_RECIPE_TRANSFORM PARTITION_WRITE json hdfs://<HDFS_PATH> difficulty python executor.py ingest_recipe HTTP_READ json <INPUT_PATH> RECIPES_TRANSFORM WRITE json file://<LOCAL_PATH> BEEF_RECIPE_TRANSFORM PARTITION_WRITE json hdfs://<HDFS_PATH> difficulty

Below lists the keywords associated with each tasks.

  • "HDFS_READ": "pipeline.task.Reader",
  • "HTTP_READ": "pipeline.task.HttpReader",
  • "WRITE": "pipeline.task.Writer",
  • "PARTITION_WRITE": "pipeline.task.PartitionWriter",
  • "FILE_WRITE": "pipeline.task.Writer",
  • "RECIPES_TRANSFORM": "pipeline.task.TransformRecipes",
  • "BEEF_RECIPE_TRANSFORM": "pipeline.task.BeefRecipes",
  • "EMAIL": "pipeline.task.EmailTask",
  • "SLACK": "pipeline.task.SlackTask",
  • "KAFKA": "pipeline.task.KafkaTask"

About

A dynamic data pipeline building api by ordering tasks in sequence. The data pipeline is build with Apache Spark and Python.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages