This repo is in Python 2.7.8. You will need to
- Install Python 2.7.8 and add Python to your path (if installing with
apt-get
orbrew
or an equivelant package manager on linux and mac systems, this should happen automatically - Install pip the python package manager.
- Optionally install virtualenv by running
pip install virtualenv
To parse xml articles, you'll need two system packages, libxml2 and libxsl. On ubuntu, install with sudo apt-get install libxml2 libxslt1-dev
Then, install the required python libraries with:
pip install -r requirements.txt
Windows Note: Running the above command will only partially work and will error on libxml. You must manually download and install it.
- Install mongo
- Install genghisapp with
gem install genghisapp
, which is like PHPMyAdmin for MongoDB. genghisapp requires ruby / rubygems. You can installruby
by following this guide and installgem
by downloading and installing from here
Once everything is installed, you can run the crawler with
python main.py configs/simple-config.json
This will run the crawler with the simplest possible setup. It will crawl articles from the main CNN RSS feed and write them to a directory as JSON files.
To configure different behavior, you can specify a different configuration file. There are several pre-built configuration files in the configs/
directory. If none of them do what you want, consider making a new configuration. See configs/configuration.md
for more details.
- Install Vagrant
- Start and provision Vagrant environment with
vagrant up
- Enter Vagrant environment with
vagrant ssh
The project directory is by default mapped to /vagrant in the virtual machine.