scrapy-streamitem

Overview

Scrapy support for working with streamcorpus StreamItems.

Includes the following:

StreamItem: Scrapy Stream Item definition. streamitem.items.StreamItem
StreamItemLoader: Scrapy Itemloader for StreamItem. streamitem.loaders.StreamItemLoader
StreamItemExporter: Scrapy ItemExporter to .sc file. streamitem.exporters.StreamItemExporter
StreamItemFileFeedStorage: Scrapy FileFeedStorage to handle .sc files. streamitem.storages.StreamItemFileFeedStorage

Stream Items

Scrapy Stream Item will be populated from response with the following fields:

url: A string containing the URL of the response.
body: A string containing the body of this Response.
source_url: If response has been redirected, a string containing the URL of the original page. Defaults to None.
redirect_urls: If response has been redirected, a list containing the URLs of all the redirected pages, including the current one. Defaults to None.
http_status: An integer representing the HTTP status of the response. Example: 200, 404.
content_type: A string containing the Content-Type HTTP header of the response.
response_size: An integer representing the response body size in bytes.
metadata: A dict containing arbitrary metadata for this page.

If items are exported they will generate streamcorpus StreamItem items filling the following fields:

abs_url: item.url
source_url: item.source_url
body.raw: item.body
body.media_type: item.content_type
body.language.code: item.metadata.language_code
body.language.name: item.metadata.language_name
source_metadata['redirect_urls']: item.redirect_urls
source_metadata['response_size']: item.response_size
source_metadata: will be filled with all fields in item.metadata

How to use it

An example of use from a spider:

def parse_page(self, response):
    loader = StreamItemLoader(item=StreamItem(), response=response)
    return loader.load_item()

Settings for exporting:

FEED_URI = ".exports/streamitems.sc"
FEED_FORMAT = "streamcorpus"
FEED_EXPORTERS = {
    'streamcorpus': 'scrapylib.streamitem.exporters.StreamItemExporter',
}
FEED_STORAGES = {
    '': 'scrapylib.streamitem.storages.StreamItemFileFeedStorage',
}

You can also add additional info to your item using the metadata field. For example from a Item pipeline:

def process_item(self, item, spider):
     item['metadata']['my_custom_field'] = 'whatever'
     return item

Requirements

Scrapy >= 0.22.0
streamcorpus

Install

using pypi:

pip install scrapy-streamitem

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
streamitem		streamitem
tests		tests
.gitignore		.gitignore
.travis.yml		.travis.yml
README.rst		README.rst
requirements.txt		requirements.txt
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

streamitem

streamitem

tests

tests

.gitignore

.gitignore

.travis.yml

.travis.yml

README.rst

README.rst

requirements.txt

requirements.txt

setup.py

setup.py

tox.ini

tox.ini

Repository files navigation

scrapy-streamitem

Overview

Stream Items

How to use it

Requirements

Install

About

Releases

Packages

Contributors 2

Languages

scrapy-plugins/scrapy-streamitem

Folders and files

Latest commit

History

Repository files navigation

scrapy-streamitem

Overview

Stream Items

How to use it

Requirements

Install

About

Resources

Stars

Watchers

Forks

Languages