Skip to content

darius/slithersentence

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

slithersentence

Crawling and scraping to collect Chinese sentences. This is a preliminary prototype, directed at a single domain.

Version

V. 0.31, 20130407.

To use

  1. Install the bs4 module from http://www.crummy.com/software/BeautifulSoup/.

  2. Initialize database

    sqlite3 crawl_worldjournal.db < create_table.sqlscript
  1. Create directory for downloads
    mkdir CRAWLED_PAGES
  1. Run downloader.py and link_collector.py alternately.
  • downloader.py: downloads pages and stores them in CRAWLED_PAGES/, compressed and using the MD5 hash of the content as the core of the file name; the candidate URLs are taken from the database crawl_worldjournal.db
  • link_collector.py: collects URLs from downloaded pages and stores them in the database crawl_worldjournal.db.
  1. These two programs can usefully be run in alternation until link_collector.py returns no links successfully added (the output is a string of all | and no .. Note that downloader.py will always return at least one successful download — the top-level page of the site.

New as of this version

  1. Functions are now highly modularized and specialized.
  2. Reduced number of state attributes.
  3. Eliminated the old verbose output flag and implemented logging.
  4. Fixed error that led to endless loop if no new URLs to be downloaded.

To do next

  1. Second class, parallel to Downloader but always instantiated within it, for dealing with links and nothing else.
  2. Begin working with pytest.
  3. Third and fourth phase of this project: to collect the Chinese content and store it to database; and to extract distinct sentences from the content and store them to a separate database, which will be the foundation of linguistic study.
  4. Use extra table for dealing with the start page of the site, so as to keep it separate from all the other pages. But perhaps this is not a good idea because we would like to be able to look up the URL and other data for each hash in a single table. Adding a second table would complicate that look-up — perhaps a third table, indexing all hashes against the other tables — would become necessary.
  5. Attempt to generalize crawler to more sites.
  6. Crawler should eventually use threads or independent processes (from shell?) to conduct continuous crawling; manage this with a queue.

Previous versions

0.3, 20130405.

0.2, 20130402.

0.1, 20130325.

[end]

About

Crawling and scraping to collect Chinese sentences

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%