Skip to content

deepak64/scraper-test

Repository files navigation

Feedwizard Scrapers

After creating a scraper

Don't forget to add a new guesser for the spider in scraper-ui repository. The guesser class if found at application/models/guesser.php

The spiders use Scrapy.

Running tests

A limited number of spiders have unittest tests in the brightcorp/test/spiders/ directory. For example:

python -m unittest brightcorp.test.spiders.test_autocrawl

Robots.txt

The robotparser determines whether we can scrape a site according to its robots.txt. This code is translated to JavaScript in the Robots Tester chrome extension and Dharma. Please propagate any changes to those repos.

Deployment

See go/foragerdeploy.

To check disk space of mining nodes

ansible scraper -m shell -a df -i /scraper-ui/tmp/scraper.hosts

Delete logs

for ip in `cat /scraper-ui/tmp/scraper.hosts`; do ssh -t $ip "sudo rm -rf /var/log/scrapyd/brightcorp/*"; done

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published