Skip to content

mtaziz/jobboardscraper

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Job Board Scraper

Job Board Scraper collects, cleans, organizes, and indexes English teaching positions from an existing online job board once a day.

The code scrapes the job board with Scrapy and integrates it into a Django website with an Elasticsearch search index and a PostgreSQL database. The website is hosted on Heroku.

Install

Prerequisites: Python 3, SQLite, Redis, pip, virtualenv, virtualenvwrapper, Git.

$ mkvirtualenv jobboardscraper -p python3
$ git clone git@github.com:richardcornish/jobboardscraper.git
$ cd jobboardscraper/
$ pip install -r requirements.txt
$ cd jobboardscraper/
$ python manage.py migrate
$ python manage.py loaddata jobboardscraper/fixtures/*
$ python manage.py createsuperuser
$ python manage.py runserver

Open http://127.0.0.1:8000. Kill with Ctrl+C.

Setting a virtualenv default directory is usually a good idea:

$ setvirtualenvproject $WORKON_HOME/jobboardscraper/ ~/Sites/jobboardscraper/jobboardscraper/
$ cdproject

Scrape

To run the spider to scrape the website:

$ cd scraper/
$ scrapy crawl eslcafe

Search

Elasticsearch is required to build and update the search index. Assuming Homebrew is installed, initial indexing:

$ brew tap caskroom/cask
$ brew cask install java
$ brew install elasticsearch
$ elasticsearch
$ python manage.py rebuild_index

Future indexing:

$ python manage.py update_index

Deploy

If you're using Heroku, deploying requires the Heroku Toolbelt:

Heroku add-ons I installed:

Initial deploy:

$ heroku login
$ heroku create
$ heroku config:set SECRET_KEY='...' # replace with your own
$ heroku config:set DEBUG=''
$ heroku addons:create heroku-postgresql:hobby-dev
$ heroku addons:create heroku-redis:hobby-dev
$ heroku addons:create searchbox:starter
$ git push heroku master
$ heroku run python jobboardscraper/manage.py migrate
$ heroku run python jobboardscraper/manage.py loaddata jobboardscraper/jobboardscraper/fixtures/*
$ heroku run python jobboardscraper/manage.py createsuperuser
$ heroku open

Future deploys:

$ git push heroku master

After installation you can scrape the website and build the search index on Heroku:

$ heroku run '(cd jobboardscraper/scraper/ && scrapy crawl eslcafe)'
$ heroku run python jobboardscraper/manage.py rebuild_index

Future scraping and indexing are handled by daily Celery tasks with a Redis broker.

About

A job board scraper that creates a structured website

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 58.3%
  • HTML 40.0%
  • JavaScript 1.4%
  • CSS 0.3%