Skip to content

General purpose email/keyword regex crawler for non-illicit purposes

Notifications You must be signed in to change notification settings

nikitautiu/barrel

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

barrel

General purpose email/keyword regex crawler for non-illicit purposes.

How to run

Simply install the requirements. The project adds a new scrapy command barrel which can be used to perform broad crawls. It takes the same options as scarpy crawl, but adds a couple additional arguments.

To do a run:

  • Install the requirements
  • Create a local_settings.py file at the root of the project to override the default keywords.
from barrel.settings.settings import *

# this checks if the letters a and b appear on the page
KEYWORD_ITEMS = {
    'a': r'a',
    'b': r'b'
}

# this collects all numbers inside paragraphs
COLLECT_ITEMS = {
    'numbers': {'regex': r'[0-9]+', 'css': 'p'} 
}

For more info on the syntax, check barrel.extractor

  • Run with these settings:
SCRAPY_SETTINGS_MODULE=local_settings scrapy barrel http://someurl.com/

Note: All settings can be overriden with the -s option just like scarpy crawl

EXMAPLE: To do a domain-wide javascript-enabled crawl that exports the results to a jsonlines file, run the following command:

SCRAPY_SETTINGS_MODULE=local_settings scrapy barrel -j -o a.json -t jsonlines -d 0 http://url.com

About

General purpose email/keyword regex crawler for non-illicit purposes

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages