Skip to content

A auto crawler for deep listing websites lke StackOverflow, Amazon, TedTalks,.. no x-path specified

Notifications You must be signed in to change notification settings

trungkak/ez-crawl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

IMPORTANT NOTE: this version is no longer maintained, new version can be found here: https://github.com/trungkak/ezcrawl

Auto deep listing web crawler

This is a web content extracting module written in Python, it's heavily based on python lxml. It works best on Deep websites (websites that result information based on what you entered) like Amazon, StackOverflow, Ebay,..

Given a front page url, it will extract all products/articles links from it (including pagination). It can also extract users comment about the products/articles.

Documents

  • Bottom-Up Region Extractor for Semi-Structured Web Pages - Wachirawut Thamviset, Sartra Wongthanavasu Pdf
  • Demo video: youtube
  • Extracting Informative Textual Parts from Web Pages Containing User-Generated Content: pdf

About

A auto crawler for deep listing websites lke StackOverflow, Amazon, TedTalks,.. no x-path specified

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published