Awesome Web-crawling Robotz
After you clone the repository, enter the directory and perform the following setup.
-
Create a python virtual environment (optional, but recommended)
$ python -m virtualenv gpython
-
Activate the virtual environment (you have to do this for every terminal you wish to run GabyBots in)
$ source ./gpython/bin/activate
-
Download the required python packages
$ pip install -r requirements.txt
-
Create your database
$ ./manage.py syncdb $ ./manage.py migrate
-
Add the default scrapers (you can use 'minimal' to create a database with no web sources)
$ ./manage.py loaddata starter
You can run the starter spider, which will scrape the Google News - World RSS Feed. First you must be in the gbots directory.
$ cd gbots/
Then you can run scrapy.
$ scrapy crawl google-news -a id=1
If you want it to add the scraped articles to the database, you would use:
$ scrapy crawl google-news -a id=1 -a do_action=yes