Skip to content

ishankhare07/web-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WebCrawler

a webcrawler written in python with urging requests and very BeautifulSoup

Installing

  • Clone this repo
$ git clone git@github.com:ishankhare07/scrapper.git && cd scrapper
  • Create a Virtual Environment, assuming python3
$ pyvenv venv
$ source venv/bin/activate
  • Install requirements from pip, (again assuming pip3 for python3)
$ pip3 install -r requirements.txt

Using the API

  • Assuming python3 again
>>> from main import Scrapper
>>> s = Scrapper("http://news.ycombinator.com/","heacker_news")     #url, filename to store data
>>> s.start_scrapping()
  • We can also issue recursion depths and max-urls to scan
>>> from main import Scrapper
>>> s = Scrapper("http://news.ycombinator.com/",            #url
                "hacker_news",                              #filename to store data
                20,                                         #max-recursion depth
                30)                                         #max-urls to scan
>>> s.start_scrapping()
  • Viewing the data
>>> import shelve
>>> from pprint import pprint
>>> db = shelve.open('hacker_news')
>>> pprint(list(db.items()))

About

a scrapper written in python with urging requests and very BeautifulSoup

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages