This was a Saturday side project and the goal was to index all of the articles in the current issues of hacker monthly and make the data searchable. I wanted this because there were multiple times when I was at work and remembered some cool utility or hack I read about and couldn't find which Hacker Monthly contained the wisdom I needed.
To use this project you will need the following requirements:
- Python 2.7+ (virtual environments optional)
- Elastic search version 1.1.1 - https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.1.1.tar.gz
Follow the steps below to get your own index running:
$ wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.1.1.tar.gz
$ tar zxvf elasticsearch-1.1.1.tar.gz
$ cd elasticsearch-1.1.1/bin
$ ./elasticsearch
Woah. Super easy. Elastic search is awesome.
$ git clone git@github.com:glenbot/hacker_monthly.git
You need to copy all of your Hacker Monthly epub files into the data
directory of this project. You can get all of the epub files from the subscriber area of hacker monthly https://subscriber.hackermonthly.com/
$ cd hacker_monthly
$ cp <path_to_your_epubs>/*.epub data
This instruction may vary. The end result is that you need epub files in the data
directory.
Make sure you have elastic search running in a terminal somewhere.
With virtual environment (recommended). This assumes you have installed virtualenv and virtualenvwrapper:
$ git clone git@github.com:glenbot/hacker_monthly.git
$ mkvirtualenv hm
$ pip install -r hacker_monthly/requirements.txt
$ cd hacker_monthly
$ python manage.py create_index
$ python manage.py index_data
$ python manage.py runserver
Without virtual environment (you may need sudo):
$ git clone git@github.com:glenbot/hacker_monthly.git
$ pip install -r hacker_monthly/requirements.txt
$ cd hacker_monthly
$ python manage.py create_index
$ python manage.py index_data
$ python manage.py runserver
Visit http://127.0.0.1:5000
in your browser and search away.
If you add or remove hacker montly files from the data
directory you can always run
$ python manage.py reindex_data
$ python manage.py query_index "Some search string"
Some of the older hacker monthly from 2010 are not parsing correctly.
Also, the TOC in the epub files don't contain the articles in the epub which makes it harder to parse. It does a decent job of it though.
Please use the github issue tracker.
Pull requests always welcome :)