Skip to content

neurodebian/snapshot-neuro.debian.net

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction
============

This is an implementation for a possible snapshot.debian.org service.
It's not yet finished, it's more a prototype/proof of concept to show
and learn what we want and can provide.  So far it seems to actually work.

The goal of snapshot.d.o is to collect a history of how our archives
(the main debian.org archive, security, volatile etc) looked at any
point in time.  It should allow users to say get the vim package that
was in testing on the 13th of January 2009.

Storage
-------

The way snapshot currently stores data is actually quite simple:
 - The actual files' content is stored in the "farm".  Here each file
   is stored under a name that depends on its content: its sha1-hashsum.
   For reasons that weasel no longer is sure actually still apply nowadays
   the filesystem tree that houses those file is hashed also.

   A typical farm looks like this:
	.
	./00
	./00/00
	./00/00/00000dd4ba37c55ce75c8cad71f03e85ee09a370
	./00/03
	./00/03/000354b82181fbb7de26f2060bda3d96bbff28d5
	./00/03/00037ab88f67808c40f741bdc678254ce560df24
	./00/06
	./00/06/000648707e514afd7b7d4912f4062c13062012ba
  (not very well filled yet)


 - The rest is stored in a postgresql database.  This database has a
   concept of an _archive_ ("debian", "debian-security", "backports.org",
   etc.).  Each archive has any number of _mirrorrun_ entries.  These
   correspond to one import of that mirror into the database.

   An actual represented filesystem tree consists of _node_s.  Each node
   corresponds to an item in an imported archive tree.
   A node is either a _directory_, a regular _file_, or a _symlink_. No
   other filetypes are currently supported - and we probably don't need
   much else.

   A directory node stores the full path from the archive root.  A
   regular file knows its original (base)name, its size, and its sha1
   hashsum (so we can find it in the farm).  And a symlink knows its
   original base(name) and its target (i.e. what it points to).

   Each of those nodes has a reference to its parent directory in
   order to reconstruct the original filesystem structure.  A node
   also contains pointers to the first and the last mirrorrun it appeared
   in.  Some files, like Packages.gz files, will probably only exist
   for a single mirrorrun while other files like actual packages or
   the root directory (/) exist for many, or even all of the mirrorruns
   for a given archive.


Access
------

Currently - apart from directly querying the SQL - there's a Python
implementation of a snapshotfs fuse file system.  It's not really
built to scale well, but for now it's one way to access the data.

There's also a pylons web application; find it in ./web/app.

Indexing
--------

The base of the snapshot implementation is pretty agnostic as to what
kind of data we take snapshots of.  It could just as well be your home
directory or the bugs database.

On top of that, there is a concept of packages.  To be more precise
there are source packages, each with a version, and there are binary
packages with versions that belong to a source package.

Source packages have files associated with it (by file hash), and
so do binary packages.  In the relation for bin packages we also
keep track of what kind of arch this binary package is for.

This _indexer_ works either by parsing the /indices/package-file.map.bz2
file in an ftp tree, or if that doesn't exist, by looking at each .dsc
and each .deb/.udeb file.


TODO
====

 o We might need to handle the case where packages are removed from
   the archive because we aren't allowed to redistribute them.  Then we
   could set all files belonging to a package to
   locked/unreadable/whatever.

 - Another thing would be to teach the system about releases so that one
   can get "sarge r2" from the system.  Again, it should be able to
   build on top of what is already there.

 o A mirrorrun import should write out a "transaction log"-style flat
   file that lists all the nodes added in this mirrorrun, plus data
   identifying the mirrorrun (id, date, archive).  This can be then used
   to inject this run's metadata into a mirror.

 o When adding files to the farm, copy them to a .temp-hash filename,
   and only link them in place once they are there.

 . farm fsck:
   o check for all the files whether their hashsum matches their name.
   - check if all of them are references in the database
     (new ones might not be there yet, instead being added in that
      transaction)
   o find list of hashes referenced in the DB but not in the farm.
   - integrate parts into one coherent tool

 o Write a nice web frontend.
   o handle symlinks (in the database stat() function)
   o make breadcrumbs useful links
   o packages
   o handle XSS
     x either verify that mako escapes all the input,
     x check if it does in 0.9.7 (there is some filter config option)
     x or add some layer that copies stuff to the template context doing that
     * pylons 0.9.7's mako config automatically escapes any special
       chars in template variables.
   o handle 'bobby tables' as datestring in urls

 o Write some scripts to help setting up a mirror.  This should be easy
   given that our farm is easy to rsync and we (will) have a way to inject
   mirrorruns without having to dump/reload the entire DB.
   o keep a journal of recently added farm entries and use this for farm
     syncing.  This will make syncing much faster than rsyncing the whole
     tree, which still can (and should) be done regularly as a safety net.

 - snapshotfs (the fuse filesystem) doesn't handle the case where
   a source package has more than one version of a binary package of a given
   name (such as with binnmus).  e.g. fuse/packages/g/gtk2-engines/1:2.16.1-2

 - Come up with a testsuite that checks at least the most basic things still
   work, such as importing a tree.

 [ o .. done; - to do; . partially done ]


Installation
============

Dependencies
------------

Depends: ruby libdbd-pg-ruby1.8 libbz2-ruby1.8 libbz2-ruby1.8 python-yaml python-psycopg2 python-lockfile fuse-utils python-fuse uuid-runtime
DB-Depends: postgresql-plperl-8.4 postgresql-8.4-debversion
fsck-Depends: python-dev gcc
FUSE-Depends: python-fuse
Web-Depends: python-pylons (that is {python-pylons,python-routes,python-nose,python-paste,python-pastedeploy,python-pastescript,python-webob,python-weberror,python-beaker,python-mako,python-formencode,python-webhelpers,python-decorator,python-simplejson}/lenny-backports )
Web-Recommends: libapache2-mod-wsgi apache2
Apache config: required modules: expires headers

To index binary and source packages, we currently need https://github.com/brianmario/bzip2-ruby.

Basic setup
-----------

# as psql super::

  createuser -DRS snapshot
  dropdb snapshot                 # Drop the one played with before
  createdb -O snapshot snapshot   # Create the DB
  psql -f db/db-init.sql snapshot # Initialize DB (Access rights and languages)

# as snapshot user::
  ARCHIVE_NAME=<ARCHIVE> ARCHIVE_PATH=<PATH TO ARCHIVE>
  cd db
  psql -f db-create.sql snapshot
  ./upgrade ../snapshot.conf      # Introduce necessary DB devel- "upgrades"
  cd ..
  ./snapshot -c snapshot.conf -v -a $ARCHIVE_NAME add-archive # Initialize the archive
  ./snapshot -s -c snapshot.conf -p $ARCHIVE_PATH -v -a $ARCHIVE_NAME import


Source Code
===========

snapshot
  The snapshotting script
db/
  SQL and Python scripts to initialize and upgrade the DB backend
etc/
  Configuration files for Apache + cronjob configuration files
  for the master and mirror repositories
fsck/
  Tools to verify the integrity of the system
fuse/
  FUSE module
lib/
  Base components used by various scripts
master/
  Tools to be used with the master repository (e.g. import-run)
mirror/
  Tools to support operation of the mirrors
web/
  Pylons Web frontend

Bugs
====
	node_with_ts is horribly slow:
		 SELECT hash
			   FROM file JOIN node_with_ts ON file.node_id = node_with_ts.node_id
			   WHERE first_run <= '2009-09-15 21:32:24'
			     AND last_run  >= '2009-09-15 21:32:24'
			     AND name = 'openssl_0.9.8g.orig.tar.gz'
			     AND parent=320

		takes about 10 times as long as the equivalent:

		SELECT hash
			   FROM file JOIN node ON file.node_id = node.node_id
			   WHERE (SELECT run FROM mirrorrun WHERE mirrorrun_id=first) <= '2009-09-15 21:32:24'
			     AND (SELECT run FROM mirrorrun WHERE mirrorrun_id=last) >= '2009-09-15 21:32:24'
			     AND name = 'openssl_0.9.8g.orig.tar.gz'
			     AND parent=320

About

WiP NeuroDebian snapshots(based on http://snapshot.debian.org). Devels deployment only for now

http://snapshot-neuro.debian.net

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published