Skip to content

eason001/adBot

Repository files navigation

adBot v 1.3.2

Input: a list of urls Each url should follow the format xxx.domain.xxx in order to be parsed correctly by the ibot.py script. The input file name should be urls.txt.

Output: screenshots and html source code ibot stores each url's screenshot and source code base on its domain name in "domain.png" and "domain.txt" format respectively. They are stored under the output path specified in the argument. Images are located under <output_path>/data/img and texts are under <output_path>/data/src. If domain name is repeated, then it adds a numeric value in the end as domainX.png and domainX.txt (where X is the auto increment numeric value).

How to run: python ibot.py <output_path> <number of jobs (optional)> ibot.py grabs a list of urls and save screenshots and source codes in the local disk, under /img and /src respectively. The default timeout for loading a page is 30 sec. The maximum number of jobs is equal to the total number of available CPU cores. If no argument is specified, then ibot.py runs in single processor mode by default.

About

Scikit image, Computer Vision, pySpark, Selenium, parallel computing, distributed computing

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published