Skip to content

mback2k/ArchiveBot

 
 

Repository files navigation

1. ArchiveBot

    <SketchCow> Coders, I have a question.
    <SketchCow> Or, a request, etc.
    <SketchCow> I spent some time with xmc discussing something we could
                do to make things easier around here.
    <SketchCow> What we came up with is a trigger for a bot, which can
                be triggered by people with ops.
    <SketchCow> You tell it a website. It crawls it. WARC. Uploads it to
                archive.org. Boom.
    <SketchCow> I can supply machine as needed.
    <SketchCow> Obviously there's some sanitation issues, and it is root
                all the way down or nothing.
    <SketchCow> I think that would help a lot for smaller sites
    <SketchCow> Sites where it's 100 pages or 1000 pages even, pretty
                simple.
    <SketchCow> And just being able to go "bot, get a sanity dump"

2. More info

For the user's guide, read the COMMANDS file.
For a half-assed installation and operation guide, read INSTALL.
For a polished installation guide, submit a pull request.

3. License

Copyright 2013 David Yip; made available under the MIT license.  See
LICENSE for details.

4. Acknowledgments

Thanks to Alard (@alard), who added WARC generation and Lua scripting to
GNU Wget.  Wget+lua was the first web crawler used by ArchiveBot.

Thanks to Christopher Foo (@chfoo) for wpull, ArchiveBot's current web
crawler.

Thanks to Ivan Kozik (@ivan) for maintaining ignore patterns and
tracking down performance problems at scale.

Other thanks go to the following projects:

* Celluloid <http://celluloid.io/>
* Cinch <https://github.com/cinchrb/cinch/>
* CouchDB <http://couchdb.apache.org/>
* Ember.js <http://emberjs.com/>
* Redis <http://redis.io/>
* Seesaw <https://github.com/ArchiveTeam/seesaw-kit>

5. Special thanks

Dragonette, Barnaby Bright, Vienna Teng, NONONO.

The memory hole of the Web has gone too far.
Don't look down, never look away; ArchiveBot's like the wind.

 vim:ts=2:sw=2:tw=72:et

About

ArchiveBot, an IRC bot for archiving websites

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Ruby 58.1%
  • Python 29.8%
  • CoffeeScript 8.3%
  • CSS 3.7%
  • JavaScript 0.1%