What is this ?

_

What is this ?

This software crawls articles published on the frontpage of various online news outlets. For every article, it extracts its title, category, content, links and links to embedded medias. The extracted data is stored in a plaintext database, as a series of JSON files.

3rd party Dependencies

scrapy's HtmlXPathSelector <http://doc.scrapy.org/en/ latest/topics/selectors.html#scrapy.selector.HtmlXPathSelector> : because any BeautifulSoup-based app is an half-assed implementation of XPath anyway.
BeautifulSoup. This project currently uses a mix of version 3 and 4 of BeautifulSoup. It's not pretty but porting the old code was not a priority. They use different namespaces so there are no confusions.
nose for unit testing.

License

This project is licensed under the MIT open-source license. See LICENSE.txt for details.

Notes

This project was tested with python 2.6 and python 2.7.

Name		Name	Last commit message	Last commit date
Latest commit History 797 Commits
csxj		csxj
sample_data		sample_data
scripts		scripts
tests		tests
.hgignore		.hgignore
.hgtags		.hgtags
.travis.yml		.travis.yml
LICENSE.txt		LICENSE.txt
MANIFEST.in		MANIFEST.in
README.rst		README.rst
nose.cfg		nose.cfg
requirements.txt		requirements.txt
run_tests.sh		run_tests.sh
run_tests_with_coverage.sh		run_tests_with_coverage.sh
setup.py		setup.py

License

sevas/csxj-crawler

Folders and files

Latest commit

History

Repository files navigation

What is this ?

3rd party Dependencies

License

Notes

About

Resources

License

Stars

Watchers

Forks

Languages