Skip to content

A data processing pipeline that schedules and runs content harvesters, normalizes their data, and outputs that normalized data to a variety of output streams. Data collected can be explored at https://osf.io/share/, and viewed at https://osf.io/api/v1/share/search/. Developer docs can be viewed at https://osf.io/wur56/wiki

License

PatrickEGorman/scrapi

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

scrapi

master build status: Build Status

develop build status: Build Status

Coverage Status Code Climate

Getting started

  • To run absolutely everyting, you will need to:
    • Install requirements.
    • Install Elasticsearch
    • Install Cassandra
    • Install harvesters
    • Install rabbitmq (optional)
  • To only run harvesters locally, you do not have to install rabbitmq

Requirements

  • Create and enter virtual environment for scrapi, and go to the top level project directory. From there, run
$ pip install -r requirements.txt

Or, if you'd like some nicer testing and debugging utilities in addition to the core requirements, run

$ pip install -r dev-requirements.txt

This will also install the core requirements like normal.

Installing Cassandra and Elasticsearch

note: JDK 7 must be installed for Cassandra and Elasticsearch to run

note: As long as you don't specify Cassandra or Elasticsearch and set RECORD_HTTP_TRANSACTIONS to False in your local.py, you shouldn't need to have them installed to get at least basic functionality working

Mac OSX

$ brew install cassandra
$ brew install elasticsearch

Ubuntu

Install Cassandra
  1. Check which version of Java is installed by running the following command:

    $ java -version

    Use the latest version of Oracle Java 7 on all nodes.

  2. Add the DataStax Community repository to the /etc/apt/sources.list.d/cassandra.sources.list

    $ echo "deb http://debian.datastax.com/community stable main" | sudo tee -a /etc/apt/sources.list.d/cassandra.sources.list
  3. Add the DataStax repository key to your aptitude trusted keys.

    $ curl -L http://debian.datastax.com/debian/repo_key | sudo apt-key add -
  4. Install the package.

    $ sudo apt-get update
    $ sudo apt-get install dsc20=2.0.11-1 cassandra=2.0.11
Install ElasticSearch
  1. Download and install the Public Signing Key.

    $ wget -qO - https://packages.elasticsearch.org/GPG-KEY-elasticsearch | sudo apt-key add -
  2. Add the ElasticSearch repository to yout /etc/apt/sources.list.

    $ sudo add-apt-repository "deb http://packages.elasticsearch.org/elasticsearch/1.4/debian stable main"
  3. Install the package

    $ sudo apt-get update
    $ sudo apt-get install elasticsearch


__Now, just run__
```bash
$ cassandra
$ elasticsearch

Or, if you'd like your cassandra session to be bound to your current session, run:

$ cassandra -f

and you should be good to go.

(Note, if you're developing locally, you do not have to run Rabbitmq!)

Rabbitmq (optional)

Mac OSX

$ brew install rabbitmq

Ubuntu

$ sudo apt-get install rabbitmq-server

Settings

You will need to have a local copy of the settings. Copy local-dist.py into your own version of local.py -

cp scrapi/settings/local-dist.py scrapi/settings/local.py

If you installed Cassandra and Elasticsearch earlier, you will want add the following configuration to your local.py:

RECORD_HTTP_TRANSACTIONS = True  # Only if cassandra is installed

NORMALIZED_PROCESSING = ['cassandra', 'elasticsearch']
RAW_PROCESSING = ['cassandra']

Otherwise, you will want to make sure your local.py has the following configuration:

RECORD_HTTP_TRANSACTIONS = False

NORMALIZED_PROCESSING = ['storage']
RAW_PROCESSING = ['storage']

This will save all harvested/normalized files to the directory archive/<source>/<document identifier>

note: Be careful with this, as if you harvest too many documents with the storage module enabled, you could start experiencing inode errors

If you'd like to be able to run all harvesters, you'll need to register for a PLOS API key.

Add the following line to your local.py file:

PLOS_API_KEY = 'your-api-key-here'

Running the scheduler (optional)

  • from the top-level project directory run:
$ invoke beat

to start the scheduler, and

$ invoke worker

to start the worker.

Harvesters

Run all harvesters with

$ invoke harvesters

or, just one with

$ invoke harvester harvester-name

Note: harvester-name is the same as the defined harvester "short name".

Invove a harvester a certain number of days back with the --days argument. For example, to run a harvester 5 days in the past, run:

$ invoke harvester harvester-name --days=5

###Working with the OSF

-To run on the OSF type

$ inv provider_map 

Testing

  • To run the tests for the project, just type
$ invoke test

and all of the tests in the 'tests/' directory will be run.

About

A data processing pipeline that schedules and runs content harvesters, normalizes their data, and outputs that normalized data to a variety of output streams. Data collected can be explored at https://osf.io/share/, and viewed at https://osf.io/api/v1/share/search/. Developer docs can be viewed at https://osf.io/wur56/wiki

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 100.0%