Wpull is a Wget-compatible (or remake/clone/replacement/alternative) web downloader and crawler.
Features:
- Written in Python: lightweight & robust
- Familiar Wget options and behavior
- Graceful stopping and resuming
- Python & Lua scripting support
- Modular, extensible, & asynchronous API
- PhantomJS integration
Currently in beta quality! Some features are not implemented yet and the API is not considered stable.
Requires:
- Python 2.6, 2.7, 3.2, 3.3 (or newer)
- Tornado
- Toro
- lxml
- chardet
- BeautifulSoup4
- SQLAlchemy
- Lunatic Python (bastibe version) (optional for Lua support)
- PhantomJS (optional)
Once you install the requirements, install Wpull from PyPI using pip:
pip3 install wpull
For detailed installation instructions, please see http://wpull.readthedocs.org/en/master/install.html.
To download the About page of Google.com:
wpull google.com/about
To archive a website:
wpull billy.blogsite.example --warc-file blogsite-billy \
--no-check-certificate \
--no-robots --user-agent "InconspiuousWebBrowser/1.0" \
--wait 0.5 --random-wait --waitretry 600 \
--page-requisites --recursive --level inf \
--span-hosts --domains blogsitecdn.example,cloudspeeder.example \
--hostnames billy.blogsite.example \
--reject-regex "/login\.php" \
--tries inf --retry-connrefused --retry-dns-error \
--delete-after --database blogsite-billy.db \
--quiet --output-file blogsite-billy.log
To see all options:
wpull --help
Documentation is located at http://wpull.readthedocs.org/.
Need help? Please see our Help page which contains frequently asked questions and support information.
The issue tracker is located at https://github.com/chfoo/wpull/issues.
Contributions and feedback are greatly appreciated.
Copyright 2013-2014 by Christopher Foo. License GPL v3.
This project contains third-party source code licensed under different terms:
- backport
- wpull.backport.argparse
- wpull.backport.collections
- wpull.backport.functools
- wpull.backport.tempfile
- wpull.backport.urlparse
- wpull.thirdparty.robotexclusionrulesparser
We would like to acknowledge the authors of GNU Wget as Wpull uses algorithms from Wget.