Crawling and scraping to collect Chinese sentences. This is a preliminary prototype, directed at a single domain.
V. 0.31, 20130407.
-
Install the
bs4
module from http://www.crummy.com/software/BeautifulSoup/. -
Initialize database
sqlite3 crawl_worldjournal.db < create_table.sqlscript
- Create directory for downloads
mkdir CRAWLED_PAGES
- Run
downloader.py
andlink_collector.py
alternately.
downloader.py
: downloads pages and stores them inCRAWLED_PAGES/
, compressed and using the MD5 hash of the content as the core of the file name; the candidate URLs are taken from the databasecrawl_worldjournal.db
link_collector.py
: collects URLs from downloaded pages and stores them in the databasecrawl_worldjournal.db
.
- These two programs can usefully be run in alternation until
link_collector.py
returns no links successfully added (the output is a string of all|
and no.
. Note thatdownloader.py
will always return at least one successful download — the top-level page of the site.
- Functions are now highly modularized and specialized.
- Reduced number of state attributes.
- Eliminated the old
verbose
output flag and implementedlogging
. - Fixed error that led to endless loop if no new URLs to be downloaded.
- Second class, parallel to Downloader but always instantiated within it, for dealing with links and nothing else.
- Begin working with
pytest
. - Third and fourth phase of this project: to collect the Chinese content and store it to database; and to extract distinct sentences from the content and store them to a separate database, which will be the foundation of linguistic study.
- Use extra table for dealing with the start page of the site, so as to keep it separate from all the other pages. But perhaps this is not a good idea because we would like to be able to look up the URL and other data for each hash in a single table. Adding a second table would complicate that look-up — perhaps a third table, indexing all hashes against the other tables — would become necessary.
- Attempt to generalize crawler to more sites.
- Crawler should eventually use threads or independent processes (from shell?) to conduct continuous crawling; manage this with a queue.
0.3, 20130405.
0.2, 20130402.
0.1, 20130325.
[end]