Don't forget to add a new guesser for the spider in scraper-ui repository. The guesser class if found at application/models/guesser.php
The spiders use Scrapy.
A limited number of spiders have unittest
tests in the brightcorp/test/spiders/
directory. For example:
python -m unittest brightcorp.test.spiders.test_autocrawl
The robotparser determines whether we can scrape a site according to its robots.txt. This code is translated to JavaScript in the Robots Tester chrome extension and Dharma. Please propagate any changes to those repos.
See go/foragerdeploy.
ansible scraper -m shell -a df -i /scraper-ui/tmp/scraper.hosts
for ip in `cat /scraper-ui/tmp/scraper.hosts`; do ssh -t $ip "sudo rm -rf /var/log/scrapyd/brightcorp/*"; done