Client interface for Scrapinghub API

Requirements

Python 2.6 or above
Requests library

Usage

The scrapinghub module is a Python library for communicating with the Scrapinghub API.

First, you connect to Scrapinghub:

>>> from scrapinghub import Connection
>>> conn = Connection('APIKEY')
>>> conn
Connection('APIKEY')

You can list the projects available to your account:

>>> conn.project_ids()
[u'123', u'456']

And select a particular project to work with:

>>> project = conn['123']
>>> project
Project(Connection('APIKEY'), '123')
>>> project.name
'123'

To schedule a spider run (it returns the job id):

>>> project.schedule('myspider', arg1='val1')
u'4ca37770a1a3a24c45000005'

To get the list of spiders in the project:

>>> project.spiders()
[u'spider1', u'spider2']

To get all finished jobs:

>>> jobs = project.jobs(state='finished')

jobs is a JobSet. JobSet objects are iterable and, when iterated, return an iterable of Job objects, so you typically use it like this:

>>> for job in jobs:
...     # do something with job

Or, if you just want to get the job ids:

>>> [x.id for x in jobs]
[u'4c916f80e8bd6f68c2000000', u'4c9170fae8bd6f6cac000000', u'4c9190bde8bd6f761c000000']

To select a specific job:

>>> job = project.job('4cdacfe7a1a3a27d7a000000')
>>> job.id
'4cdacfe7a1a3a27d7a000000'

To retrieve all scraped items from a job:

>>> for item in job.items():
...     # do something with item (it's just a dict)

To retrieve all log entries from a job:

>>> for logitem in job.log():
...     # logitem is a dict with logLevel, message, time

To get job info:

>>> job.info['spider']
'myspider'
>>> job.info['started_time']
'2010-09-28T15:09:57.629000'
>>> job.info['tags']
[]
>>> job.info['fields_count]['description']
1253

To mark a job with tag consumed:

>>> job.update(add_tag='consumed')

To mark several jobs with tag consumed (JobSet also supports the update() method):

>>> project.jobs(state='finished').update(add_tag='consumed')

To delete a job:

>>> job.delete()

To delete several jobs (JobSet also supports the update() method):

>>> project.jobs(state='finished').delete()

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
.travis.yml		.travis.yml
NEWS		NEWS
README.rst		README.rst
requirements.txt		requirements.txt
scrapinghub.py		scrapinghub.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.travis.yml

.travis.yml

NEWS

NEWS

README.rst

README.rst

requirements.txt

requirements.txt

scrapinghub.py

scrapinghub.py

setup.py

setup.py

Repository files navigation

Client interface for Scrapinghub API

Requirements

Usage

About

Releases

Packages

alepharchives/python-scrapinghub

Folders and files

Latest commit

History

Repository files navigation

Client interface for Scrapinghub API

Requirements

Usage

About

Resources

Stars

Watchers

Forks