Skip to content

dsdinter/spookystuff

 
 

Repository files navigation

Latest doc already moved to:

http://tribbloid.github.io/spookystuff/

SpookyStuff

Codeship Status for tribbloid/spookystuff Join the chat at https://gitter.im/tribbloid/spookystuff

... is a scalable query engine for web scraping/data mashup/acceptance QA. The goal is to allow the Web being queried and ETL'ed like a relational database.

SpookyStuff is the fastest big data collection engine in history, with a speed record of querying 330404 dynamic pages per hour on 300 cores.

Powered by

  • Apache Spark
  • Selenium
  • JSoup
  • Apache Tika
  • (build) Apache Maven
  • (browser integration) PhantomJS/GhostDriver
  • (drone integration) MAVLink
  • Current implementation is influenced by Spark SQL and Mahout Sparkbinding.

Apache Spark Selenium Apache Tika Apache Maven PhantomJS MAVLink

License

Copyright © 2014 by Peng Cheng @tribbloid, Sandeep Singh @techaddict, Terry Lin @ithinkicancode, Long Yao @l2yao and contributors.

Published under ASF License, see LICENSE.

About

Scalable query engine for web scrapping/data mashup/acceptance QA, powered by Apache Spark

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Scala 79.1%
  • JavaScript 6.7%
  • HTML 6.2%
  • Python 5.4%
  • Java 2.3%
  • Shell 0.3%