croawl

Crawler to predict the availability of a full text at a given URL

Goal

Bielefeld Academic Search Engine covers almost 100 million metadata records harvested from open repositories. The problem is that for most of them, we have no idea whether they link to a full text or not. For instance, this item does not seem to be associated with a freely downloadable full text, whereas this one is.

The goal of this software is to perform this classification (not necessarily archiving the full texts when they are available, just classifying the URLs into open / closed).

How this crawler works

This crawler visits the URLs present in the OAI records stored by BASE. If one of the URLs of the record points to a PDF file, we happily mark the record as free to read. Otherwise, we expect to land on an HTML page describing the paper, such as a HAL landing page. We look for meta tags with metadata about the paper: sometimes it contains a link to the full text (as required by the Google Scholar inclusion guidelines). If this link actually leads to a PDF, we mark the paper as free to read. In all other cases, we consider the paper as unavailable.

Crawling 100 million URLs takes a lot of time, so we try to reduce the number of requests we need to do. This is achieved by learning URL patterns to predict the nature of a page without crawling it. For instance, we can learn that https://hal.archives-ouvertes.org/hal-[0-9]*/document returns a PDF file most of the time, so when we see such an URL we consider it is a PDF file without actually downloading it. This classifier is implemented in the urltheory module.

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
accesspredict		accesspredict
cache		cache
html/stats		html/stats
urltheory		urltheory
.gitignore		.gitignore
.travis.yml		.travis.yml
README.md		README.md
config.py.in		config.py.in
dumptree.py		dumptree.py
requirements.txt		requirements.txt
run_tests.py		run_tests.py
scrapy.cfg		scrapy.cfg
start.py		start.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

accesspredict

accesspredict

cache

cache

html/stats

html/stats

urltheory

urltheory

.gitignore

.gitignore

.travis.yml

.travis.yml

README.md

README.md

config.py.in

config.py.in

dumptree.py

dumptree.py

requirements.txt

requirements.txt

run_tests.py

run_tests.py

scrapy.cfg

scrapy.cfg

start.py

start.py

Repository files navigation

croawl

Goal

How this crawler works

About

Releases

Packages

Languages

dissemin/croawl

Folders and files

Latest commit

History

Repository files navigation

croawl

Goal

How this crawler works

About

Resources

Stars

Watchers

Forks

Languages