Find duplicated content of web
Phase 1
- apple
heroku
and build a new project - add node_modules http-server and ngrok
- install heroku-cli (https://devcenter.heroku.com/articles/getting-started-with-nodejs#set-up)
- set heroku app config var
- combine redux redux kit to heroku
Phase 2
- choose parser tool
- use node.js and get all links
- change to use
python
- use
scrapy
orbeauifulsoup
lib - use
beauifulsoup
to find duplicated links
Phase 3
- build auto shell script
- use
cron
orpm2
to parser hourly - build auto Crawer to go to different webs
- let user change url link
- use scrapy to scrapy second or third depth link of the homepage
// get duplicated link and draw on the homepage
$ python viki.py
// auto run viki and git commit and push result
$ sh autoPublish.sh
We can get a viki homepage and color duplicated link. In order to view the duplicated contents, we use the same color and numebr on the component.
The result will automatically produced in /src/resource/viki_20161220_00000.html
.
https://github.com/YanlongLai/python_web_lint