Skip to content

tianyaqu/economist-scrapy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

economist-scrapy

scrapy spider and postgresql pipline for economist

diligent spider for EcoArchive

nltk

article summary is generated by nltk, you'll have to install nltk package and its corpora data: brown, averaged_perceptron_tagger and punkt first

crawl and storage

use postgresql as storage, assuing that you have a postgresql service running on localhost and default port. set your own(db, uername, password)in setting.py

use crontab tools to add crawl job each week

install

  1. install postgresql first

yum install postgresql-server postgresql-contrib

postgresql-setup initdb

  1. install python packages

pip install -r requirement.txt

  1. install nlktk corporas

ipython

import nltk

nltk.download('punkt')

nltk.download('brown')

nltk.download('averaged_perceptron_tagger')

crawl

mv collector/setting.py.example collector/setting.py

change the db user and password

scrapy crawl eco

About

scrapy spider and postgresql pipline for economist

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages