Web_Crawler

Web_crawler bot spider

Multi-threaded website crawler written in Python

This is a tool to gather all the hyperlinks of a webpage. This could be modified for web analytics and webscraping.

main.py

It contain the main multithreading functions like creating job queue and allocating jobs to individual threads or spider.

spider.py

It contain the Spider class which has all the attributes required like crawled links, queue links , base url , page url, etc. It basically decode webpage into string and call the Linkfinder function which gives all the hyperlinks. It then updates all the files.

Link_finder.py

It checks for anchor ('a') tag with attribute 'href' in the feeded html string given by a spider thread and stores all the links in a txt file which contain the domain name.

domain.py

This is used to get the domain name from a url.

utility.py

This contain all the utility functions like creating new txt file, writing into files, converting files to set and set to file, etc.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
__pycache__		__pycache__
.gitattributes		.gitattributes
README.md		README.md
domain.py		domain.py
link_finder.py		link_finder.py
main.py		main.py
spider.py		spider.py
utility.py		utility.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pycache

pycache

.gitattributes

.gitattributes

README.md

README.md

domain.py

domain.py

link_finder.py

link_finder.py

main.py

main.py

spider.py

spider.py

utility.py

utility.py

Repository files navigation

Web_Crawler

Multi-threaded website crawler written in Python

main.py

spider.py

Link_finder.py

domain.py

utility.py

About

Releases

Packages

Languages

prvnsingh/Web_Crawler

Folders and files

Latest commit

History

Repository files navigation

Web_Crawler

Multi-threaded website crawler written in Python

main.py

spider.py

Link_finder.py

domain.py

utility.py

About

Resources

Stars

Watchers

Forks

Languages