This is python package for merging taxonomies to continue the releases in the OTT series of taxonomies. This package might end up just being a sub-package of peyotl, or die a young death.
The code is based on parts of the reference-taxonomy repository, which was written primarily by Jonathan A. Rees as a part of the NSF-funded Open Tree of Life project.
virtualenv
is always recommended for experimental
Python packages.
Taxalotl depends on:
requests
beautifulsoup4
- some utilities that are only on the
taxalotl
branch ofpeyotl
So, the easiest way to install right now is:
./install.sh
which will:
- create a virtualenv called
venv
, - install prerequisites in it,
- install the correct version of
peyotl
using the "develop" command to pip (to make a symlink), and - install the taxalotl package (also using the "develop" command)
The script ends with some comments about the actions you need to take to configure the package.
See peyotl docs for info about the config files of peyotl. These affect the logging message handling of Taxalotl.
The taxalotl-cli.py
script provides the command-line interface which
is broken up into several commands.
The command-level documentation is below.
See the Tutorial.md for an overview of usage.
The syntax ${x}
in the documentation below refers to
the value of some variable (x
in this case),
that should be one of the configuration variables specified in
the taxolotl.conf
file.
taxalotl-cli.py status
reports on the status of each "resource".
taxolotl-cli.py status ncbi
reports just on the status of the
ncbi resource.
taxolotl-cli.py download ID
downloads the archive for
the ID
resource into the ${raw}
directory if that
archive is not present.
taxolotl-cli.py unpack ID
unpacks the archive for
the ID
resource to the ${raw}/ID
.
Downloads the archive if necessary.
taxolotl-cli.py normalize ID
unpacks the raw archive
for ID
from ${raw}/ID
into the
OTT Interim Taxonomy
format in ${normalized}/ID
Unpacks the raw archive if necessary.
An "extra" details.json
file may also be written with
more information about the normalization process.
This information is "extra" in the sense that it was not
emitted by the reference-taxonomy repo's version of the code.
That dir should hold descriptions of the taxonomies that
are sources of information.
The file .merged.json
in that directory is automatically
generated as the union of the fields in all of the
other files in the directory;
so you should not edit that file by hand.
Most info should probably just go in resources.json
, but
you can also put info in a file with the name
<resource-id>.json
to make the resources list more
manageable.
Tags used to describe the resources are still in flux. Every resource should have either:
- a
resource_type
property (with value "external taxonomy", "open tree taxonomy idlist", or "open tree taxonomy"), or - an
inherits_from
property with the value corresponding to the ID of another resource.
So, there is a resource with id="ncbi"
that is intended
to hold info that applies to any version of the NCBI
Taxonomy.
Specific dated snapshots of NCBI inherit from either that
"base" resource or the previous snapshot.
This will create a directed linear graph.
If you use the base id (ncbi
) in a command that operates
on a concrete resource, taxalotl
will assume that you
mean the latest version of that resource.
If you source completion.sh
in your bash
session and you have
the top-level directory on your PATH
then you'll have some
cool tab completion of commands and options.
The package is named after Ambystoma mexicanum ... and the fact that there are a lot o' taxa... and the whole Open Tree of Life thing...
It might become Taxolotl
instead of Taxalotl
.