IMPORTANT NOTE: this version is no longer maintained, new version can be found here: https://github.com/trungkak/ezcrawl

Auto deep listing web crawler

This is a web content extracting module written in Python, it's heavily based on python lxml. It works best on Deep websites (websites that result information based on what you entered) like Amazon, StackOverflow, Ebay,..

Given a front page url, it will extract all products/articles links from it (including pagination). It can also extract users comment about the products/articles.

Documents

Bottom-Up Region Extractor for Semi-Structured Web Pages - Wachirawut Thamviset, Sartra Wongthanavasu Pdf
Demo video: youtube
Extracting Informative Textual Parts from Web Pages Containing User-Generated Content: pdf

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.idea		.idea
src		src
test		test
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.idea

.idea

src

src

test

test

.gitignore

.gitignore

README.md

README.md

Repository files navigation

IMPORTANT NOTE: this version is no longer maintained, new version can be found here: https://github.com/trungkak/ezcrawl

Auto deep listing web crawler

Documents

About

Releases

Packages

Languages

trungkak/ez-crawl

Folders and files

Latest commit

History

Repository files navigation

IMPORTANT NOTE: this version is no longer maintained, new version can be found here: https://github.com/trungkak/ezcrawl

Auto deep listing web crawler

Documents

About

Topics

Resources

Stars

Watchers

Forks

Languages