Skip to content

1060460048/djangoscraper

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Django Scraper is obsolete

Django Scraper was written for scrapy 0.7. Since then, scrapy 0.8 came out that had many improvement, many of these improvements make Django Scraper architecturally obsolete. I abandoned this project a while ago.

if you’re looking for this kind of functionality, I would recommend that you look into celery in combination with latest version of scrapy. This would give you a scallable implementation of task based scraping.

Django Scraper app – djangoscraper

Introduction

Django Scraper app is an integration of Django Web Framework and Scrapy Web Crawling Framework. It was created to simplify
scraping of large websites that contain a variety of data that needs to be extracted in different ways.

As I began working with scrapy I found it difficult to manage the complexity of the website that I was trying to scrape,
because scrapy architecture requires you to have 1 spider per domain. This contraint made it difficult for me to structure the
code in a clear and modular way because the code for all of the scraping tasks had to be in the same spider.

I prefer to think of spiders as having tasks. This makes it easier for me to work on specific spider functionality without
involving all of the other spider tasks.

To work on this way, I introduced a concept of a spider Task. A spider task is something that a spider has to do and it produces
either items or other spider tasks.

Tasks are stored in Django database and can be manipulated using Django admin interface. Django admin allows you to
add new tasks, view status of tasks, filter tasks.

Tasks have similar properties to scrapy Requests, except they take multiple urls using the start_urls property.

Installation

Django Scraper App functions like a standard Django Application. If you follow a non django code organization then you would
install djangoscraper as you would any other django application.

New Django Install

  1. Create project structure
    
            django-admin startproject example
            scrapy-ctl.py startproject scraper
            mv scraper/* example
            rm -R scraper
            
  1. Add ‘djangoscraper’ to INSTALLED_APPS in django’s settings.py
  2. Add ‘djangoscraper.commands’ to COMMANDS_MODULE in scrapy’s settings.py
  3. To access scrapy from django, add the following code somewhere in django’s settings.py
    
            os.environ.setdefault('SCRAPYSETTINGS_MODULE', 'scraper.settings')
            
  1. To access django from scrapy, add the following code somewhere in scrapy’s settings.py
    
            os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'settings')
            

Add djangoscraper to existing scrapy project

  1. Create django project
    
            django-admin startproject {django_project_name}
            
  1. Move scraper into django project
    
            mv {scraper_project_dir}/* {django_project_name}
            
  1. Add ‘djangoscraper’ to INSTALLED_APPS in django’s settings.py
  2. Add ‘djangoscraper.commands’ to COMMANDS_MODULE in scrapy’s settings.py
  3. To access scrapy from django, add the following code somewhere in django’s settings.py
    
            os.environ.setdefault('SCRAPYSETTINGS_MODULE', '{scraper_project_dir}.settings')
            
  1. To access django from scrapy, add the following code somewhere in scrapy’s settings.py
    
            os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'settings')
            

Configuration

Creating a spider

Usage

Releases

No releases published

Packages

No packages published