TheReadingMachine

This repository contains the complete implementation of the thereadingmachine, a program to scrap, process, score and model the sentiment of news articles for the purpose of predicting the future trend of commodity prices.

Setup

First of all, make sure you have the database the_reading_machine.db in the data directory.

Then run the setup script

source setup.sh

This will setup virtualenv, install all the thereadingmachine package, and any other dependency from requirements.txt.

All dependent nltk dataset will also be downloaded into the data directory.

Next it will configure airflow and setup the airflow database (airflow.db) that will store all information about the pipeline scheduling.

Structure

The repository is structured as follow:

root/
  ├── airflow/
  ├── data/
  ├── pipeline/
  ├── thereadingmachine/
  ├── sandbox/  
  ├── ...
  └── requirements.txt

airflow

This folder contains the configuration, logs, and the database for running airflow.

When a new procedure is to be added to the pipeline, it needs to be added to the DAG file.

data

This folder contains all the data required. This includes the database (the_reading_machine.db), and all supplementary data such as nltk corpus.

pipeline

All processes that will eventually be scheduled in the pipeline will be implemented under this folder.

The standard structure is to have a sub-folder containing two files. controller.* and processor.*. The controller.* will contain all the class and function definitions, while the processor.* file will load the definitions and perform the actual processing.

This is designed for maximum flexibility during the development. The controller.* class and functions will eventually be refactored into thereadingmachine.

This will eventually become a Python module when the codes in the pipeline are refactored during the end of the phase of the development.

sandbox

Any old, obsolete, unused code will be moved here for future reference.

Starting and killing the pipeline

There are two scripts provided to start and kill the pipeline.

To start the pipeline, simply execute

./start_pipeline.sh

You can then navigate to localhost:8080 to see the web interface of the pipeline.

To kill the pipeline, simple enter the following in the command line.

./kill_pipeline.sh

Name		Name	Last commit message	Last commit date
Latest commit History 462 Commits
airflow		airflow
data		data
pipeline		pipeline
sandbox		sandbox
thereadingmachine		thereadingmachine
.gitignore		.gitignore
README.md		README.md
kill_pipeline.sh		kill_pipeline.sh
mvp_plan.md		mvp_plan.md
package_management.md		package_management.md
requirements.txt		requirements.txt
set_env_var.sh		set_env_var.sh
setup.sh		setup.sh
start_pipeline.sh		start_pipeline.sh

mkao006/TheReadingMachine

Folders and files

Latest commit

History

Repository files navigation

TheReadingMachine - A Mean, Lean, Reading Machine

Setup

Structure

airflow

data

pipeline

thereadingmachine

sandbox

Starting and killing the pipeline

About

Resources

Stars

Watchers

Forks

Languages