Skip to content

huokedu/social_scraper

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Build Status

Social scraper

Retrieves user profiles from social networks simulataneusly. Send spiders to the web and gather social content therein!

Install

  • python setup.py install
  • install celery
  • install redis
  • edit social_scraper/settings.py add facebook & twitter auth tokens

Test

  • python run_tests.py

Run

  • start_scraper

The server is running on port 8080 by default

Celery

Be sure to run celery worker before you start:

celery -A social_scraper.webapi.celery worker

Enjoy

curl -i http://localhost:8080/api/v0.1/users/twitter/sikorskiradek
curl -i http://localhost:8080/api/v0.1/users/facebook/barackobama

you may also access user_profile from js client or web browser

to just run spider, type:

  • scrapy runspider twitter -A <username>
  • scrapy runspider facebook -A <username>

Deploy

Scrapyd allows deploying spiders, starting and stopping them using JSON web service

  • pip install scrapyd
  • scrapyd-deploy -p social_scraper

Architecture overview

alt tag

Job requests (spiders) are initialized from webserver using celery and send to scrapy ecosystem

Written with Twisted, a popular event-driven networking framework for Python. Thus, it’s implemented using a non-blocking (aka asynchronous) code for for concurrency.

Todo

  • Linkedin spider

About

Scalable social scraper

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%