Skip to content

pldheeraj/Hak

Repository files navigation

Habakkuk

Habakkuk is an application for filtering tweets containing Christian bible references. The goal is to capture the book name, chapter number, verse number and tweet text for further analysis.

Django

This project uses django for project organization purposes. Perform the following to set up the virtual environment.

$ virtualenv .
$ . ./bin/activate
$ pip install -r requirements.txt

Storm

This project uses a storm topology to analyze tweets from the twitter sample stream. The entry point is a storm spout that uses twitter4j to access the stream with a username and password. Tweets are then passed to a storm shell bolt implemented in Python that applies a regular expression for detecting Christian bible references. Finally, a bolt receives the tuple with a bible reference tag and stores it to elasticsearch.

For more information refence the storm concepts wiki. I also have a habakkuk starter page that provides some background.

Elasticsearch

This project uses ElasticSearch as backend storage. Please reference the site for details.

Accumulo

I experimented with using Apache Accumulo. The code has been disabled but the Bolt is still there is anyone wants to try it. It works fine but I found Elasticsearch worked better for this project.

Hadoop

Scripts in analysis/ depend on Cloudera Hadoop CDH3.

Sub-Directories

  • java - Storm Application
  • bible_verse_matching - Tools to build and test the bible reference regular expressions. Also dictionary files for pig and mahout.
  • elasticsearch - Index templates and tools to query elasticsearch
  • accumulo - Table initialization scripts
  • config - Configuration files for setting up storm with supervisord
  • analysis - pig scripts for data analysis