WebCrawler

a webcrawler written in python with urging requests and very BeautifulSoup

Installing

Clone this repo

$ git clone git@github.com:ishankhare07/scrapper.git && cd scrapper

Create a Virtual Environment, assuming python3

$ pyvenv venv
$ source venv/bin/activate

Install requirements from pip, (again assuming pip3 for python3)

$ pip3 install -r requirements.txt

Using the API

Assuming python3 again

>>> from main import Scrapper
>>> s = Scrapper("http://news.ycombinator.com/","heacker_news")     #url, filename to store data
>>> s.start_scrapping()

We can also issue recursion depths and max-urls to scan

>>> from main import Scrapper
>>> s = Scrapper("http://news.ycombinator.com/",            #url
                "hacker_news",                              #filename to store data
                20,                                         #max-recursion depth
                30)                                         #max-urls to scan
>>> s.start_scrapping()

Viewing the data

>>> import shelve
>>> from pprint import pprint
>>> db = shelve.open('hacker_news')
>>> pprint(list(db.items()))

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
saver.py		saver.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

README.md

README.md

main.py

main.py

requirements.txt

requirements.txt

saver.py

saver.py

Repository files navigation

WebCrawler

Installing

Using the API

About

Releases

Packages

Languages

ishankhare07/web-crawler

Folders and files

Latest commit

History

Repository files navigation

WebCrawler

Installing

Using the API

About

Resources

Stars

Watchers

Forks

Languages