GitHub - ziy212/HeadlessBrowser

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 219 Commits
data		data
esprima		esprima
public		public
samples		samples
urls		urls
.gitignore		.gitignore
LegacyASTAnalyzer.py		LegacyASTAnalyzer.py
Queue.js		Queue.js
README		README
ast.py		ast.py
db_client.py		db_client.py
db_server.js		db_server.js
ecmavisitor.py		ecmavisitor.py
extractScript.py		extractScript.py
extractStringsFromOutput.py		extractStringsFromOutput.py
fetch.sh		fetch.sh
handler.py		handler.py
hook.js		hook.js
html_parser.py		html_parser.py
my_string.py		my_string.py
node_pattern.py		node_pattern.py
phantom_manager.py		phantom_manager.py
phantom_worker.js		phantom_worker.js
prepare.sh		prepare.sh
processForumURL.py		processForumURL.py
processScript.py		processScript.py
processURL.py		processURL.py
push_update.sh		push_update.sh
remove_contents.js		remove_contents.js
s.js		s.js
script_analyzer.py		script_analyzer.py
send_get_req.py		send_get_req.py
simple_handler.py		simple_handler.py
spider.py		spider.py
spider2.py		spider2.py
stop_manager.sh		stop_manager.sh
string_comparison.py		string_comparison.py
template.py		template.py
test.py		test.py
trees.py		trees.py
utilities.py		utilities.py
wordpress30.txt		wordpress30.txt
wordpress42.txt		wordpress42.txt

Repository files navigation

Steps to run the service:
1. start mongodb, create db: webcontents
2. prepare.sh               #install prerequsites, not a complete list
3. nodemon db_server.js     #mongo db service (port 4040)
4. python phantom_manager.py log_dir phanton_worker.js_path # this manager starts a server (port 8082) to receive web contents crawl/execution tasks


Steps to train a website:
1. modify spider.py to give a domain name
2. scrapy runspider spider.py > evaluation/tmp/doman.txt
3. processURL.py evaluation/tmp/doman.txt evaluation/urls/
5.  python send_get_req.py evaluation/urls/domain_train.txt
	python send_get_req.py evaluation/urls/domain_test.txt

==============
6. python processScript.py evaluation/urls/domain_train.txt # get script
	python processScript.py evaluation/urls/domain_test.txt # get script
8. train: mkdir tmp/domain; python template.py evaluation/urls/domain_train.txt tmp/domain
9. test:  python handler.py domain evaluation/urls/domain_test.txt 

=======CHECK SCRIPT================
1. modify spider.py to give a domain name
2. scrapy runspider spider.py > evaluation/tmp/`doman`.txt 	#crawl URL
3. python processURL.py evaluation/tmp/`doman`.txt evaluation/urls/      #process URL list
4. python send_get_req.py evaluation/urls/`domain`_train.txt    # render those URLs in headless browser
   # request will be sent to phantom_manager to execute, wait it to finish
   # screen -r manager # check if there is still task running "no working process ..."
   # ctrl + A; D; to leave screen session
5. python processScript.py evaluation/urls/`domain`_train.txt #extract script from pages
6. mkdir ./tmp/`domain`; python template.py evaluation/urls/`domain`_train.txt ./tmp/`domain`
7. read the contents of ./tmp/`domain`/debug
   An example is at ./tmp/pawprintpets/debug and its analysis is at ./tmp/pawprintpets/analysis
   format: 
     [number: key
       --EXAMPLE-- exmpale]
	



Some commands:
notes: ps aux | grep "python" | grep -v "grep"| awk '{print $2}'  | xargs kill -9
scrapy runspider spider.py
db.trees.createIndex({domain:1, key:1},{unique:true})