Skip to content

tim-moody/scraper-tools

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

80 Commits
 
 
 
 
 
 

Repository files navigation

scraper-tools

This is a collection of tools I use for pulling content from various sites.

Status

Still under development

Deployment - Future

will put in PyPI

apt install python3-pip pip3 install CacheControl pip3 install youtube_dl

Components

  • generic/sp_lib.py - importable functions
  • generic/core.py - Class of common functions and variables
  • generic/crawl.py - Class for spidering a target site
  • generic/crunch.py - Class for analyzing crawl data
  • scripts/crawler.py - template for spidering site

BasicSpider Algorithm

  • Get start url from parameter or start domain
  • Verify that it returns html with no redirect
  • Add start url to queue
  • Loop until done or key pressed
    • Get next url from queue
    • Compute file name
    • If already downloaded and not force
      • Read html from file
    • Else
      • Requests.get page html for url (can only be html url)
      • Write html to file
    • If already in site_pages
      • add to page count
      • continue
    • For each link on page
      • If matches ignore list continue
      • Get link info with requests.head
      • Record link in site_urls
      • If html
        • If matches include and not matches exclude
          • Add to queue
    • Record page and children in site_pages
  • Write data structures to files

Links

A page is searched for 'a' and 'link' tags that have an 'href' property and 'audio', 'embed', 'iframe', 'img', 'input', 'script', 'source', 'track', and 'video' tags that have a 'src' property. For each of the resulting links, if the url does not match Ignore Links, requests.head is used to determine the content type of its target and its size.

Filtering

In order to be queued for parsing a url must return html and must not match Ignore Links, must match a Pattern to Include and not match a Pattern to Exclude.

Ignore Links

These are things like 'javascript:void(0)' which will be ignored without any request for attributes. The list is in IGNORE_LINKS, which can be extended.

Url Patterns to Include

These are a list of regular expressions in HTML_INCL_PATTERNS. Any url that returns html is matched against these patterns and if there is a match it is queued for parsing.

Url Patterns to Exclude

These are a list of regular expressions in HTML_EXCL_PATTERNS. Any url that returns html is matched against these patterns and if there is a match it is not queued for parsing.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages