Crawl a site and cluster the pages based on layout.
This has three part
- Crawling
- Parsing and feature extraction
- Clustering based on similarity score.
Open Source Solution :
-
Apache Nutch
-
Scrapy
-
Simple crawler using BeautifulSoup or Selenium webdriver.
- Parse the HTML and generate DOM tree
- Parse CSS and generate set of css classes
- Tree edit distance to find similarity for DOM tree
- Cosine / Jacard similarity for css classes
- Cluster pages based on the similarity score.