GitHub

Basic python based recursive webscraper

Simple python utility to scrape a http or ftp locations and recursively sync remote files/diorectories.
Will traverse subdirectory hyperlinks for html pages. Modified time from the server is checked through a head requests and only new files are downloaded.
Times are synced locally to match the remote system and downloads are spread across multiple threads to speed up the whole process.

Local files are not removed if they no longer exist on the remote server, so syncing remote rolling archvies (e.g. realtime nomads) is easy.

Could probably be replicated in a single wget command, but now we have concurrency and can utilise as a function to trigger other actions.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
configs		configs
tests		tests
web_scraper		web_scraper
.coveragerc		.coveragerc
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
README.md		README.md
requirements.test.txt		requirements.test.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

configs

configs

tests

tests

web_scraper

web_scraper

.coveragerc

.coveragerc

.gitignore

.gitignore

.gitlab-ci.yml

.gitlab-ci.yml

README.md

README.md

requirements.test.txt

requirements.test.txt

setup.py

setup.py

Repository files navigation

Basic python based recursive webscraper

About

Releases

Packages

Languages

abrammer/pywebscraper

Folders and files

Latest commit

History

Repository files navigation

Basic python based recursive webscraper

About

Resources

Stars

Watchers

Forks

Languages