economist-scrapy

scrapy spider and postgresql pipline for economist

diligent spider for EcoArchive

nltk

article summary is generated by nltk, you'll have to install nltk package and its corpora data: brown, averaged_perceptron_tagger and punkt first

crawl and storage

use postgresql as storage, assuing that you have a postgresql service running on localhost and default port. set your own（db, uername, password）in setting.py

use crontab tools to add crawl job each week

install

install postgresql first

yum install postgresql-server postgresql-contrib

postgresql-setup initdb

install python packages

pip install -r requirement.txt

install nlktk corporas

ipython

import nltk

nltk.download('punkt')

nltk.download('brown')

nltk.download('averaged_perceptron_tagger')

crawl

mv collector/setting.py.example collector/setting.py

change the db user and password

scrapy crawl eco

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
collector		collector
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirement.txt		requirement.txt
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

collector

collector

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

requirement.txt

requirement.txt

scrapy.cfg

scrapy.cfg

Repository files navigation

economist-scrapy

nltk

crawl and storage

install

crawl

About

Releases

Packages

Languages

License

tianyaqu/economist-scrapy

Folders and files

Latest commit

History

Repository files navigation

economist-scrapy

nltk

crawl and storage

install

crawl

About

Topics

Resources

License

Stars

Watchers

Forks

Languages