Skip to content

EdmundMartin/ScreamingCrawl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ScreamingCrawl

An SEO Crawler meant to act as an frewware alternative to ScreamingFrog. There is a lot of work to be done.

usage: run.py [-h] [-t THREADS] [-a AGENT] [-p PROXY] [-o TIMEOUT] [-r ROBOTS]
              [-m MAX_URLS] [-d DATA_FORMAT]
              url

positional arguments:
  url                   url to start the crawl from

optional arguments:
  -h, --help            show this help message and exit
  -t THREADS, --threads THREADS
                        number of threads - scale with caution
  -a AGENT, --agent AGENT
                        user-agent
  -p PROXY, --proxy PROXY
                        proxy to use with crawler
  -o TIMEOUT, --timeout TIMEOUT
                        time to stop crawl after no new urls are found
  -r ROBOTS, --robots ROBOTS
                        whether you obey robots.txt rules
  -m MAX_URLS, --max_urls MAX_URLS
                        stop crawling after data collected from a list of urls
  -d DATA_FORMAT, --data_format DATA_FORMAT
                        data format, either csv or sql

TODO

  • Work out the most efficient way to write to SQLite when SQL output is set. Currently, is to slow.
  • Add support for MongoDB.
  • Package as a command line executable.
  • Extend SEO parser - redirect history, and other useful information.
  • Maybe add support for MongoDB???
  • Proper logging
  • Arg parsing from command line.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages