Grab is a python site scraping framework. Grab provides tons of helpful methods to scrape web sites and to work with scraped content:
- Automatic cookies (session) support
- HTTP and SOCKS proxy with and without authorization
- Keep-Alive support
- IDN support
- Tools to work with web forms
- Easy multipart file uploading
- Flexible customization of HTTP requests
- Automatic charset detection
- Powerful API of extracting info from HTML documents with XPATH queries
- Asynchronous API to make thousands of simultaneous queries. This part of library called Spider and it is too big to even list its features in this README.
- Python 3 ready
- And much, much more
- Grab has written by the guy who is doing site scraping since 2005
Check out docs: https://github.com/lorien/grab/tree/master/docs2/source
I am working hard now (Sep 2013) to complete the documentation in English.
Example of Grab usage:
from grab import Grab
g = Grab()
g.go('https://github.com/login')
g.set_input('login', 'lorien')
g.set_input('password', '***')
g.submit()
for elem in g.doc.select('//ul[@id="repo_listing"]/li/a'):
print '%s: %s' % (elem.text(), elem.attr('href'))
Example of Grab::Spider usage:
from grab.spider import Spider, Task
import logging
class ExampleSpider(Spider):
def task_generator(self):
for lang in ('python', 'ruby', 'perl'):
url = 'https://www.google.com/search?q=%s' % lang
yield Task('search', url=url)
def task_search(self, grab, task):
print grab.doc.select('//div[@class="s"]//cite').text()
logging.basicConfig(level=logging.DEBUG)
bot = ExampleSpider()
bot.run()
Pip is recommended way to install Grab and its dependencies:
$ pip install lxml
$ pip install pycurl
$ pip install grab
See details here https://github.com/lorien/grab/blob/master/docs2/source/grab_installation.rst
Russian docs: http://docs.grablib.org
English docs in progress: https://github.com/lorien/grab/tree/master/docs2/source
Mailing List (Ru/En languages): http://groups.google.com/group/python-grab/
If you have found a bug or wish a new feature please open new issue on github: