Skip to content

ziy212/HeadlessBrowser

Repository files navigation

Steps to run the service:
1. start mongodb, create db: webcontents
2. prepare.sh               #install prerequsites, not a complete list
3. nodemon db_server.js     #mongo db service (port 4040)
4. python phantom_manager.py log_dir phanton_worker.js_path # this manager starts a server (port 8082) to receive web contents crawl/execution tasks


Steps to train a website:
1. modify spider.py to give a domain name
2. scrapy runspider spider.py > evaluation/tmp/doman.txt
3. processURL.py evaluation/tmp/doman.txt evaluation/urls/
5.  python send_get_req.py evaluation/urls/domain_train.txt
	python send_get_req.py evaluation/urls/domain_test.txt

==============
6. python processScript.py evaluation/urls/domain_train.txt # get script
	python processScript.py evaluation/urls/domain_test.txt # get script
8. train: mkdir tmp/domain; python template.py evaluation/urls/domain_train.txt tmp/domain
9. test:  python handler.py domain evaluation/urls/domain_test.txt 

=======CHECK SCRIPT================
1. modify spider.py to give a domain name
2. scrapy runspider spider.py > evaluation/tmp/`doman`.txt 	#crawl URL
3. python processURL.py evaluation/tmp/`doman`.txt evaluation/urls/      #process URL list
4. python send_get_req.py evaluation/urls/`domain`_train.txt    # render those URLs in headless browser
   # request will be sent to phantom_manager to execute, wait it to finish
   # screen -r manager # check if there is still task running "no working process ..."
   # ctrl + A; D; to leave screen session
5. python processScript.py evaluation/urls/`domain`_train.txt #extract script from pages
6. mkdir ./tmp/`domain`; python template.py evaluation/urls/`domain`_train.txt ./tmp/`domain`
7. read the contents of ./tmp/`domain`/debug
   An example is at ./tmp/pawprintpets/debug and its analysis is at ./tmp/pawprintpets/analysis
   format: 
     [number: key
       --EXAMPLE-- exmpale]
	



Some commands:
notes: ps aux | grep "python" | grep -v "grep"| awk '{print $2}'  | xargs kill -9
scrapy runspider spider.py
db.trees.createIndex({domain:1, key:1},{unique:true})

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published