scraper-tools

This is a collection of tools I use for pulling content from various sites.

Status

Still under development

Deployment - Future

will put in PyPI

apt install python3-pip pip3 install CacheControl pip3 install youtube_dl

Components

generic/sp_lib.py - importable functions
generic/core.py - Class of common functions and variables
generic/crawl.py - Class for spidering a target site
generic/crunch.py - Class for analyzing crawl data
scripts/crawler.py - template for spidering site

BasicSpider Algorithm

Get start url from parameter or start domain
Verify that it returns html with no redirect
Add start url to queue
Loop until done or key pressed
- Get next url from queue
- Compute file name
- If already downloaded and not force
  - Read html from file
- Else
  - Requests.get page html for url (can only be html url)
  - Write html to file
- If already in site_pages
  - add to page count
  - continue
- For each link on page
  - If matches ignore list continue
  - Get link info with requests.head
  - Record link in site_urls
  - If html
    - If matches include and not matches exclude
      - Add to queue
- Record page and children in site_pages
Write data structures to files

Links

A page is searched for 'a' and 'link' tags that have an 'href' property and 'audio', 'embed', 'iframe', 'img', 'input', 'script', 'source', 'track', and 'video' tags that have a 'src' property. For each of the resulting links, if the url does not match Ignore Links, requests.head is used to determine the content type of its target and its size.

Filtering

In order to be queued for parsing a url must return html and must not match Ignore Links, must match a Pattern to Include and not match a Pattern to Exclude.

Ignore Links

These are things like 'javascript:void(0)' which will be ignored without any request for attributes. The list is in IGNORE_LINKS, which can be extended.

Url Patterns to Include

These are a list of regular expressions in HTML_INCL_PATTERNS. Any url that returns html is matched against these patterns and if there is a match it is queued for parsing.

Url Patterns to Exclude

These are a list of regular expressions in HTML_EXCL_PATTERNS. Any url that returns html is matched against these patterns and if there is a match it is not queued for parsing.

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
generic		generic
sites		sites
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

generic

generic

sites

sites

README.md

README.md

Repository files navigation

scraper-tools

Status

Deployment - Future

Components

BasicSpider Algorithm

Links

Filtering

Ignore Links

Url Patterns to Include

Url Patterns to Exclude

About

Releases 1

Packages

Languages

tim-moody/scraper-tools

Folders and files

Latest commit

History

Repository files navigation

scraper-tools

Status

Deployment - Future

Components

BasicSpider Algorithm

Links

Filtering

Ignore Links

Url Patterns to Include

Url Patterns to Exclude

About

Resources

Stars

Watchers

Forks

Languages