Skip to content

A BlogSpot crawler designed to index all the blogs hosted on blogspot and their language.

Notifications You must be signed in to change notification settings

gidim/BSCrawler

Repository files navigation

BSCrawler

A BlogSpot crawler designed to index all the blogs hosted on blogspot and their language.

During my work at Columbia University speech lab I was requested to harvest the web for blog data in low resource languages (Kazakh, Telugu, Tok Pisin, Lithuanian, Kurmanji). Unfortunately there are not many blogs in those languages and Google/Bing does not support filtering results by all of these languages. This tool is designed to crawl the entire blogspot blog chain and index each blog and its language using Google's Compact Language Detection library. It's designed in a multi-threaded architecture with possible support for running in a cluster.

This is still work in progress so no build/install instructions yet.

#####Dependencies:

About

A BlogSpot crawler designed to index all the blogs hosted on blogspot and their language.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published