GitHub - jamesjohnson92/mm-crawl: Markov Models for Focused Web Crawling

jamesjohnson92 / mm-crawl Public

forked from shawntan/mm-crawl

Notifications You must be signed in to change notification settings
Fork 0
Star 0

Markov Models for Focused Web Crawling

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
pjs		pjs
project		project
.gitignore		.gitignore
README		README
notes.pdf		notes.pdf
notes.tex		notes.tex
proposal.bbl		proposal.bbl
proposal.bib		proposal.bib
proposal.blg		proposal.blg
proposal.pdf		proposal.pdf
proposal.tex		proposal.tex

Repository files navigation

The amount of content on the World Wide Web has been growing exponentially. Systems have been built to cope with this large amount of data by either categorising them, or indexing them to make search possible.

With the introduction of more user-generated content in recent years, efforts have been made not only to classify the web but also toward understanding the data available. Data-mining usually involves crawling websites and analysing the data using machine learning techniques.

However, crawling the web takes up a significant amount of resources. In particular, crawlers typically open several data streams to download pages simultaneously. Many of these pages do not have important data, resulting in wastage of bandwith. Focused crawling aims to reduce this redundant IO costs by crawling only pages deemed relevant to the topic requested. 

The task of focused crawling requires the crawler to be able to discern wanted pages from unwanted pages, but by immediately discarding unwanted pages, the crawler may miss wanted pages that the discarded page linked to. As a result, some elements of planning have to be included in the design of a focused crawler in order to reduce loss of relevant pages.

The proposed project intends to investigate how Markov Decision Processes can be used for the purpose of focused crawling and its viability compared to other approaches.

About

Markov Models for Focused Web Crawling

Readme

Activity

0 stars

2 watching

0 forks

Report repository

Releases

No releases published

Packages

No packages published

Languages

Python 68.6%
TeX 27.3%
JavaScript 2.9%
Other 1.2%

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pjs

pjs

project

project

.gitignore

.gitignore

README

README

notes.pdf

notes.pdf

notes.tex

notes.tex

proposal.bbl

proposal.bbl

proposal.bib

proposal.bib

proposal.blg

proposal.blg

proposal.pdf

proposal.pdf

proposal.tex

proposal.tex

Repository files navigation

About

Releases

Packages

Languages

jamesjohnson92/mm-crawl

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Languages