Skip to content

neuggs/Working-With-Luigi-Workflows

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Working with Luigi Workflows

This project shows how Luigi workflows are used as pipelines for data science. A well-formed dataset is used to ensure the data science itself doesn't interfere with how Luigi works.

Requirements

The project was created using PyCharm, but any IDE should work. There are several packages imported including:

  • luigi
  • numpy
  • pandas
  • fpdf
  • re
  • gc
  • pickle
  • sklearn
  • nltk

You can use pip install (or whatever your preferred package manager is) to install the packages.

How it Works

There are two Python files contained within the workflows folder that contain all the code. The first workflow (workflow_one) just shows how Luigi works with a very simple example. The real workflow is contained within workflow_two and is the genesis of this project.

The basic notion, from a data science perspective, is to take a corpus, vectorize it, split it into test and train sets, pickle it (for later use), then use logistic regression to build a predictive model.

The dataset used is from Reddit.

Running Luigi

You need to pass arguments to Luigi in order for it to work. (You do this via Python.) You can do this from the command line or add arguments to the .py file run configuration in your IDE.

To see the entire pipeline run with PDF outputs from Luigi, run workflwo_two.py.

NOTE

You need to have the controversial-comments.json dataset from Reddit and put it into the /data/source directory.

About

Shows how Luigi workflows are used to build data pipelines in Python.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages