Skip to content

berenslab/ne-spectrum

Repository files navigation

This repository holds the code for https://www.jmlr.org/papers/v23/21-0055.html: Attraction-Repulsion Spectrum in Neighbor Embeddings.

If you use the work herein, we’d appreciate the following citation:

@article{boehm2022attraction,
  author  = {Jan Niklas Böhm and Philipp Berens and Dmitry Kobak},
  title   = {Attraction-Repulsion Spectrum in Neighbor Embeddings},
  journal = {Journal of Machine Learning Research},
  year    = {2022},
  volume  = {23},
  number  = {95},
  pages   = {1--32},
  url     = {http://jmlr.org/papers/v23/21-0055.html}
}

Structure/Installation

After all instructions in this section have been completed, the code can be installed via

git clone https://github.com/berenslab/ne-spectrum
cd ne-spectrum
pip install --user -r requirements.txt
python setup.py build
mv bh*.so jnb_msc/transformer/
pip install --user -e .

The above command will probably fail to compile the cython extensions. For that you need to install/compile openTSNE manually (clone the repo and install it similarly as above). This project has a build time dependency on a build time artifact (the file quad_tree.pxd) that is not installed along openTSNE by default.

After installing openTSNE this way you have to adapt the two lines in setup.py that point to the locally installed openTSNE folder, so that during the build process the missing file can be found.

Furthermore, you need a patched version of forceatlas2 from https://github.com/jnboehm/forceatlas2, where degree repulsion has been added to fa2. Install it as follows

git clone https://github.com/jnboehm/forceatlas2
cd forceatlas2
rm fa2/fa2util.c
python setup.py build
pip install --user -e .

There is also a requirements.txt file to install the dependencies. The code has been run in a conda environment with python 3.8.

The preprocessing script for the treutlein dataset resides in static/.

Running the code

To create a figure, you can simply redo one of the files in media/. For example, after installing redo, you can write redo -j6 media/ar-spectrum.pdf. This will make sure that the data is present and up-to-date and generate the figure. The instructions are written in the file media/ar-spectrum.pdf.do. This calls out to redo again ([[file:media/ar-spectrum.pdf.do::redo.redo_ifchange(datafiles + \[plotter.labelname, plotter.rc\])][l. 268, in =media/ar-spectrum.pdf.do=]]), which will recurse until all dependencies have been satisfied and afterwards create the figure. The file itself is written in python, although the do file itself is language agnostic and can be set by the shebang (#!) in the first line of the file.

To see which parameters have been set one can investigate which filenames are generated by the script (look at what is supplied to jnb_msc.redo.redo_ifchange(...)). This shows what parameters are deviating from the defaults set in the class definition.

Code structure

The classes in the project are all derived from a single base class. It forsees that every subclass implements four methods:

  1. get_datadeps()
  2. load()
  3. transform()
  4. save()

The first function allows to query the object what files it needs, this is used by redo in order to track the dependencies properly. The other remaining functions should be more or less self explanatory. It is of course also possible to use the algorithms manually. For that the .data field needs to be populated with suitable data and possibly the field .init, depending on the algorithm at hand.

There are four major different types:

  1. GenStage
  2. NDStage
  3. NNstage
  4. SimStage

GenStage is the root class for the classes that will generate a dataset. This can be simulated data or simply taking a dataset and putting it in the correct place (again, for redo and this project structure). NDStage will take in an NxD matrix and reduce its dimensionality to a lower one; one example for this would be PCA. NNStage can take the same input as NDStage (but usually takes the output of e. g. PCA) and will turn this into an NxN affinity/adjaceny matrix. This can then, in turn, be fed into the last one, SimStage. These types of classes take in both an NxN matrix and an NxD (D=2) array, that will serve as the initial layout.

There are further minor classes, for examle simple classes that will rescale the input to have a predefined std or maximum scale (code in jnb_msc/transformer/scale.py).

If anything is unclear, please let me know.

What are all those .do files?

This repository uses redo to essentially “cache” the computations that are carried out by the experiments. It works similar to `make` in that it tries to guess what files have been changed and what parts needs to be rebuilt. I chose this approach so that I wouldn’t have to either recompute everything every time or manually change the code to either load a (possibly stale) file or recompute it and save it.

For more information, the (rough) notes on the original design are here.

Unfortunately, the implementation I am using is written in python2 and hence needs to be installed separately. It is not strictly necessary to install this library, but all the code to generate the figures uses this to check the presence (and staleness) of the files. Furthermore, the load() and save() functions are written with redo in mind.

For example, to get an image of t-SNE on MNIST, one could write in the root of the repository:

redo 'data/mnist/pca/affinity/stdscale;f:1e-4/tsne/data.png'

This will “generate” the dataset MNIST, then reduce it with PCA to 50 dimensions, the default here. Afterwards it will calculate the pairwise affinities from that. Then the std will be set to the value given and finally tsne will be run with the scaled dense NxD matrix and the NxN matrix for its affinities. After the optimization, the embedding (named data.npy) will be used to create a scatter plot, which will in turn be saved as data.png. This file can then be viewed.

The prefix data/ is not mandatory. It can be omitted or it can be structured in any way. The “effect” of the other folder names is shown in jnb_msc/util.py. The names are resolved to classes. Further arguments, in colon-separated pairs, can be separated with a semicolon, for example stdscale will be called with f=1e-4.

prepped/

The folder prepped/ is used to dump all the produced files by the algorithms. This has two reasons. Firstly, it prevents clutter in the main directories. Secondly, this way the files can actually be tracked via redo since it does not support multiple output files from one run. For more information on that, see also the documentation (the heading “Virtual targets, side effects, and multiple outputs”).

About

Attraction-Repulsion Spectrum in Neighbor Embeddings

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages