github convenience fork of http://code.google.com/p/webscraping/
The webscraping library aims to make web scraping easier.
All code is pure Python and has been run across multiple Linux servers, Windows machines, as well as Google App Engine.
>>> from webscraping import common >>> common.remove_tags('hello <b>world</b>!') 'hello world!' >>> common.extract_domain('http://www.google.com.au/tos.html') 'google.com.au' >>> common.unescape('<hello & world>') '<hello & world>' >>> common.extract_emails('hello richard AT sitescraper DOT net world') ['richard@sitescraper.net'] >>> cj = common.firefox_cookie() >>> opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) >>> html = opener.open(url).read() # use current firefox cookies to access url
>>> from webscraping import download >>> D = download.Download() >>> # crawl given domain >>> domain = ... >>> for url in D.crawl(domain): >>> html = D.cache[url]
>>> from webscraping import pdict >>> cache = pdict.PersistentDict(CACHE_FILE) >>> cache['a'] = range(5) # pickle stored in sqlite database >>> 'a' in cache True >>> cache['a'] [0, 1, 2, 3, 4] (see a further example here)
>>> from webscraping import xpath >>> html = urllib2.urlopen(url).read() >>> xpath.parse(html, '/html/body/ul[2]/li[@class="info"]/div[1]') ['div content'] >>> xpath.parse(html, '/html/body/ul[2]/li[@class="info"]/a/@href') ['url1', 'url2', 'url3']
Some options to install the webscraping package.
Clone the repository: hg clone https://code.google.com/p/webscraping/
Install with pip: sudo pip install -e hg+https://code.google.com/p/webscraping/#egg=webscraping
Download zip: http://webscraping.googlecode.com/files/webscraping.zip
This code is licensed under the LGPL license.