Focused Crawler

pip install -r requirements.txt

Introduction

This is a primitive focused crawler in Python that attempts to crawl web pages on a particular topic. Given a query(a set of keywords) and a number n provided by a user, the crawler would contact a Google search engine API and get the top-10 results for this query, called the starting pages. Then the crawl from the starting pages using a focused strategy until a total of n pages being collected, with most of these pages being relevant to the query/topic. Each page would be crawled only once, and stored in a file.

How to run

Run the crawler.py file in the terminal, type in three parameters following the prompts.

Enter search query:

Set total number of webpages to be crawled:

Set limits on how many webpages be crawled from single site:

The first one “Enter search query: ” is the query you want to search. The second one is the total number of webpages you want to crawled. The third one is setting a limit for each website to avoid crawling too much webpages from single site.

Output

The crawler would output a list of all visited URLs, in the order they are visited, into a file, together with such information such as the size of each page, the depth of each page (distance from the start pages), and whether the page was relevant.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Parser.py		Parser.py
README.md		README.md
calculator.py		calculator.py
crawler.py		crawler.py
demo.log		demo.log
downloader.py		downloader.py
focused_crawler_EP500_20.log		focused_crawler_EP500_20.log
linksExtractor.py		linksExtractor.py
logger.py		logger.py
loggingconfig.py		loggingconfig.py
requirements.txt		requirements.txt
rotated.log		rotated.log
urlcheck.py		urlcheck.py
w2v.py		w2v.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parser.py

Parser.py

README.md

README.md

calculator.py

calculator.py

crawler.py

crawler.py

demo.log

demo.log

downloader.py

downloader.py

focused_crawler_EP500_20.log

focused_crawler_EP500_20.log

linksExtractor.py

linksExtractor.py

logger.py

logger.py

loggingconfig.py

loggingconfig.py

requirements.txt

requirements.txt

rotated.log

rotated.log

urlcheck.py

urlcheck.py

w2v.py

w2v.py

Repository files navigation

Focused Crawler

Introduction

How to run

Output

About

Releases

Packages

Languages

akashdeepjassal/Focused-Crawler

Folders and files

Latest commit

History

Repository files navigation

Focused Crawler

Introduction

How to run

Output

About

Resources

Stars

Watchers

Forks

Languages