This package contains the all code for Moritz Klammler's master's thesis Aesthetic Value of Graph Layouts: Investigation of Statistical Syndromes for Automatic Quantification and the subsequent paper Aesthetic Discrimination of Graph Layouts by Moritz Klammler, Tamara Mchedlidze and Alexey Pak. It is self-contained to reproduce all experiments from scratch and even typeset the written thesis, presentation slides and conference paper automatically incorporating the most recent results.
You may cite this work (the source code repository) like so:
Klammler, M. et al.: Source code for aesthetic discrimination of graph layouts, https://github.com/5gon12eder/msc-graphstudy/
The following BibTeX entry may come in handy:
@Misc{GitHubRepo,
author = "Klammler, Moritz and others",
title = "Source Code for Aesthetic Discrimination of Graph Layouts",
url = "https://github.com/5gon12eder/msc-graphstudy/",
}
Here is a list of publications about this work:
-
Klammler, M.: Aesthetic value of graph layouts: Investigation of statistical syndromes for automatic quantification. Master's thesis, Karlsruhe Institute of Technology (2018), http://klammler.eu/msc/
-
Klammler, M., Mchedlidze, T., Pak, A.: Aesthetic Discrimination of Graph Layouts. 2018; http://arxiv.org/abs/1809.01017
The second item is about to appear in the Proceedings of the 26th International Symposium on Graph Drawing, Barcelona, Spain, 2018.
Quick Help: All executables in this project (including, in particular, all utility scripts mentioned in this
README
document) support a --help
option that will provide you with a quick usage summary. This document does not
necessarily repeat all the information from that help message so please check it out.
Convention for Paths: When this document refers to a file in the repository, paths starting with a “.
”
are understood to be relative to the root of the version-controlled directory tree. Paths ending with a slash refer to
directories. Absolute paths (starting with a “/
”) refer to absolute paths on the host system and for
paths which are just a simple file name it will be understood from the context to what directory (if any) they are
relative. Globbing expressions (e.g. example-*.c
) will be used
to refer to a set of (zero or more) files.
Convention for Placeholders: This document uses “shell expansion” for placeholders. For example, we
might say “The object identified by ${id}
will be accessible though the URL
http://localhost:${port}/objects/${id}/
” where ${id}
was just introduced and you're assumed to understand from
the context that ${port}
refers to whatever port the web-server is listening at.
Convention for Interactive Shell Sessions: This document shows some illustrative examples of shell interaction. In
those, lines starting with a “$
” sign introduce commands a user (i.e. you) would enter. If the line is
too long to fit, it might be continued on the next line with the previous line ending with a backslash character.
Comments may be inserted using the usual syntax. Lines which are neither comments nor commands are output expected to
be generated by the commands.
Here is an example showing a command, a comment and some example output:
$ date -R # RFC 5322 format
Wed, 16 Jul 1969 13:32:00 +0000
Here is another example showing line continuation and shortened command output.
$ wget --no-verbose -O - 'https://raw.githubusercontent.com/5gon12eder/msc-graphstudy/master/README.md' \
| grep -Eo '(http[s]|ftp)://[-a-zA-Z0-9@:%_+.~#?&//=]+' \
| sort -u
http://arxiv.org/abs/1809.01017
http://klammler.eu/msc/
https://github.com/5gon12eder/msc-graphstudy/
https://raw.githubusercontent.com/5gon12eder/msc-graphstudy/master/README.md
...
If an example shows no output, this does not necessarily imply that the shown command is not expected to produce any. It might also just be omitted because it is not relevant to the example. Output may also be shown in a shortened form.
The primary author of this work is Moritz Klammler who wrote the code and some prose during the preparation of his master's thesis and subsequent employment. He owns the copyright for a large fraction of the work in this repository. Some utility programs included into the repository were written by Moritz Klammler in the past for different purposes. For the part of the work that (1) was written upon the request of his later employer and (2) is software, the Karlsruhe Institute of Technology, namely the Algorithmics Working Group 1 at the Institute of Theoretical Informatics (Postfach 6980, 76128 Karlsruhe, https://i11www.iti.kit.edu/) is the copyright holder. Since most files were partially written under employment and partially in free time, they have a mixed copyright ownership. The copyright notices in the comment at the top of each file aims to prescribe the situation of the individual file as faithful as possible. Finally, the paper submitted to GD'18 and included in this repository was co-authored by Tamara Mchedlidze and Alexey Pak together with Moritz Klammler who collectively own the copyright of the prose files.
Unless mentioned otherwise, all files in this repository are provided under the terms of the GNU General Public
License as published by the Free Software Foundation, either version 3 of the license, or (at your option) any later
version. You can find the full text of the license in a file called ./COPYING_GPL3.txt
or
./LICENSE
as well as online.
Some source files were added to this repository for convenience but not originally written for this project. They are
usually provided under a more flexible free software license known as the “MIT” or “X11”
license. If so, the comment at the top of the source file will mention that, including a full reproduction of the (very
short) license text. The text of this license can also be found in the file ./COPYING_MIT.txt
as
well as online.
A number of small auxiliary files is provided under an even less restrictive “all-permissive” license. The files to which this applies bear a comment at the top which says: “Copying and distribution of this file, with or without modification, are permitted in any medium without royalty provided the copyright notice and this notice are preserved. This file is offered as-is, without any warranty.” This is exactly what it is.
The prose files (but none of the functional code) in the ./report/
, ./paper/
and ./slides_*
directories are
licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License the text for which
can be found in a file called ./COPYING_CC4-BY-NC-ND.txt
as well as
online. The files to which this restrictive license applies all
mention this in a comment at the top of the file.
Finally, this README
document is published under the terms of the GNU Free Documentation License, Version 1.3 or any
later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no
Back-Cover Texts. A copy of the license is included in the section entitled
“GNU Free Documentation License” and can also be found in a file called
./COPYING_GFDL.txt
as well as online.
Tip: Running the script ./maintainer/copyright
with no arguments from the top-level source directory (of a Git
clone) will print the license used for each file in the repository.
This project aims to be portable but was, so far, only tested on Arch Linux and Parabola GNU/Linux systems. We will try to address portability issues to GNU/Linux systems once we are made aware of them. Portability to Microsoft Windows, Apple Macintosh or other proprietary and non-POSIX systems is desirable but not a priority this project will enforce at all costs. If possible, graceful degradation will happen with the core features remaining available.
In any case, this project relies on an up-to-date software environment. If your toolchain is outdated, the code will not work and this won't be “fixed” except by you updating your computer. We might code around some issues that affect the people working directly on this project but won't go to great lengths hacking around compiler bugs, missing library features or deficits in tools if we know that they are already fixed upstream. Recent versions of all relied-upon tools are readily available as Free Software.
Any accidental dependency on non-free software is considered a bug and will be fixed. Please report any such issues.
Building this project and running all experiments will consume a significant amount of computation and storage
resources. You should set aside a few dozens of gigabytes of disk storage and a couple of days of computing time when
using the configuration found in the ./config-heavy/
directory (see below).
Since the experiments create hundreds of thousands of fairly small files, be sure to have a file system that can cope with that. (The files will be stored in a directory tree of adequate depth so you need not worry about file systems that cannot handle a huge number of files in a single directory.) It might also be a good idea to use a fast hard drive or SSD for this purpose.
Depending on the size of the graphs you wish to process, you might also need a significant amount of RAM. This should not be an issue for the provided configurations which deliberately restrict the database to only medium-sized graphs. To give you an idea: The most demanding computations are processing n × n matrices of real numbers for a graph with n vertices. This means that for a graph with 50k nodes, the program will consume RAM in excess of 18 GiB. Since the algorithm has a complexity of Ο(n3), you probably also don't want to wait for it.
The driver will start a separate process for each graph / layout. (This is slow, but this project is research and not
industry.) To prevent the graph-processing tools from consuming an inordinate amount of system resources while still
allowing the driver which orchestrates them to run for extended periods of time, some environment variables are honored
by all command line tools in the ./src/
directory to control the resource limits. They all have the form
MSC_LIMIT_${RES}
and will be explained later in this document.
The project uses a CMake build system. You need at least CMake 3.9.1 or newer in order to build anything. There are no install targets, everything is built locally because the artifacts are not really meant to be used stand-alone.
The heart of the graph algorithms is implemented in C++ using the OGDF library, which will be automatically downloaded and (after verifying the integrity of the downloaded file) built as part of building this project. It is therefore not considered as an external dependency.
The project uses (at the time of this writing) fairly cutting-edge C++17 technology so a modern C++ compiler is required. GCC 8.1.0 was tested and proven to work.
Besides, the Boost C++ libraries are required in version 1.66 or newer. Apart from the
header-only components, the following compiled Boost libraries are required:
Filesystem,
Iostreams,
Program Options and
System.
If you don't have a Boost version recent enough installed, you might find the script ./maintainer/get-boost
handy,
that will download, build and install the required Boost libraries in a number of configurations for you. The integrity
of the downloads is checked. The libraries will be installed locally for the current user so no root privileges are
required and existing system libraries are not messed with.
The framework is glued together with a non-trivial amount of Python code, referred to as the driver. It requires Python 3.6.2 or newer.
The following Python packages are required:
The following Python packages are optional:
The keras
package will only function if you have TensorFlow installed which has to be
done via the usual procedures applicable to your operating system.
The web front-end and other presentation targets will only work if Gnuplot is installed.
Further optional dependencies are Doxygen to generate API documentation for
the C++ code, and a TeX toolchain (in particular, lualatex
, biber
or bibtex
and makeindex
)
for typesetting the written thesis and presentation slides. The TeXLive 2018
distribution was tested and is known to work.
In order to typeset the slides, the KIT's beamer theme must be available. KIT members can obtain it from the KIT's intranet. Unfortunately, this service is not available to the general public. Please consult the section titled “Typesetting Documents” for a discussion of some workarounds.
The easiest way to obtain the software is to clone the GitHub repository.
$ git clone 'https://github.com/5gon12eder/msc-graphstudy.git' # clone into directory 'msc-graphstudy'
Alternatively (for example, if you don't have Git), you may use the option provided by GitHub to download the current
master
branch as a single ZIP archive. This won't give you any version control information and is therefore a smaller
download.
$ wget 'https://github.com/5gon12eder/msc-graphstudy/archive/master.zip'
$ unzip master.zip # extract into directory 'msc-graphstudy-master'
Once all dependencies are available and the source code has been downloaded and (if necessary) unpacked, the project can be built using the usual CMake commands. In the most simple case, running
$ cmake .
$ cmake --build .
in the top-level directory should be sufficient. It might be a better idea (and is highly recommended for any serious hacking) to use out-of-tree builds, though. Once the project is configured and built, the tests can be run via
$ ctest -T Test
with additional CTest flags added as you see fit.
If you want to hack on the project, you might find it convenient to build different configurations in which case the
./maintainer/configure
and ./maintainer/build
scripts could be useful (please consult their --help
output).
During the build process, a number of files will be downloaded from the internet. This means that your computer will have to be connected to the internet. There is currently no support for specifying any proxy settings.
The following resources will be downloaded:
- The OGDF which will be compiled and used in executed code.
- The LaTeX file
llncs.cls
and the BibTeX filesplncs04.bst
for Springer's “Lecture Notes in Computer Science” which will be used for typesetting the GD'18 paper (only if thepaper
target is built). - Various non-executable data such as example graphs or pictures.
For each downloaded file, the SHA256 checksum is verified and the file is only used if the checksums match. Apart from
the OGDF (which is downloaded by CMake via its
externalproject_add
feature), all downloads are
performed by the script ./utils/download.py
(which you might find useful, too).
Download Cache: If the environment variable MSC_CACHE_DIR
is set (to the absolute path of an existing directory)
then the download script will use it as a download cache. If asked to download a file with expected ${algo}
hex-encoded checksum ${hash}
it will first check for a file ${MSC_CACHE_DIR}/downloads/${algo}-${hash}.oct
. If that
file exists and has the correct checksum, it will be used and no download will be attempted. Using a download cache is
recommended in particular if you build multiple configurations next to each other as it will reduce the number of
downloads and allow for offline builds once the cache is populated.
Download Trace: The download script also honors the environment variable MSC_TRACE_DOWNLOADS
which can be set to
the absolute path of a file to which the script will append one line per (attempted) download. Each line in the file
will be a JSON object holding the following keys.
url
(string) — URL for which a download was attempteddate
(string) — timestamp when the download was started (in RFC-5322 format)time
(real) — elapsed time in seconds (only present if the download was successful)size
(integer) — size of the downloaded file in bytes (only present if the download was successful)error
(string) — informal error message (only present if the download failed)digest-${algo}
(string) — computed${algo}
checksum (e.g.digest-sha256
) of the downloaded file as a hexadecimal string (only present if the download was successful)
Please note that the file as a whole is not a valid JSON document but can be transformed into one by adding a comma
after each but the last line and wrapping its entire content between “[
” and “]
”.
This trace log is not used by the build system but might help you reason about the downloads that were performed.
The default CMake target will only build the C++ tools. In order to run any experiments, you have to build the respective targets explicitly. The following targets might be useful.
deploy
— populates the database and runs the main experimenthttpd
— starts a HTTP server listening at port 8000 providing a web front-end to the databaseeval
— runs cross validation and other tests (also note the targetseval-cross-valid
,eval-puncture
,eval-puncture-excl
,eval-puncture-incl
andeval-clean
)integrity
— checks the database for inconsistenciesintegrity-fix
— checks the database for inconsistencies and tries to fix them (you shouldn't run into this unless you go messing with the database by manually deleting / adding files or executing SQL statements)apidoc
— builds Doxygen API reference documentationbenchmark
— runs some benchmarks (there are not many of them)report
— typesets the written thesis in./report/graphstudy.pdf
(also note the targetreport-clean
)slides-kit
— typesets the slides for the presentation given at the KIT on April 24 2018 in./slides_2018-04-24_kit/graphstudy.pdf
(also note the targetslides-kit-clean
)slides-gd18
— typesets the slides for the presentation given at GD'18 on September 26 2018 in./slides_2018-09-26_gd18/graphstudy.pdf
(also note the targetslides-gd18-clean
)paper
— typesets a preliminary version of the paper submitted to GD'18 in./paper/graphstudy-gd18.pdf
as well as an extended version (as submitted to Arxiv) in./paper/graphstudy-arxiv.pdf
(also note the targetspaper-cache
,paper-pubar
andpaper-clean
)test
— exercises all testsmaintainer-everything
— builds all of the above targets excepthttpd
The deploy
target will take a very long time to build (probably several days). The eval
target will take a long
time, too (probably several hours). Building the eval
target will run the experiments again each time. If the
default setting makes you nervous whether the job is actually making any progress at all, you might want to increase the
verbosity by setting the environment variable MSC_LOG_LEVEL
to INFO
(the default is NOTICE
).
Before you go ahead building the deploy
and eval
targets, you might want to read the section about driver
configuration first.
The httpd
target starts a local web server listening on port 8000 and serves visualization and other useful insights
into the current data base. (Check it out!) It won't show any results of the eval
runs,
though. These are only stores in various JSON files in the ./eval/
directory. The server process automatically forks
to the background. In order to shut it down again, send it SIGINT
. Its process ID will written to a file
${bindir}/.httpd.pid
and also printed at startup. If you find a server still running and cannot figure out which
process it is, visit http://localhost:${port}/about/
and look for a line that says “Process-ID”. The
default ${port}
is 8000. Once you know the process ID ${pid}
shut the server down like so:
$ kill -s INT ${pid}
The eval
target runs all available evaluation experiments while the eval-cross-valid
target only runs the normal
(full) cross-validation while the eval-puncture
target will only run the (reduced) cross-validation experiments for
“punctured” feature vectors. The eval-puncture-excl
and eval-puncture-incl
targets both are a sub-set
thereof, running said experiments only for the case of sole exclusion or inclusion of a single property respectively.
The eval-clean
target deletes the results of any previous experiments. Beware that the eval
as well as all of the
other mentioned eval-*
targets will start by deleting all previous results as if eval-clean
were built beforehand.
Please note that in the case of the eval-*
targets which only build a sub-set of the complete eval
target, this
means that previous results will be deleted but no recreated. The takeaway is that you should think carefully before
building eval
or any of the eval-*
targets – even more so given that they will take a long time to complete.
If the environment variable MSC_EVAL_PROGRESS_REPORT
is set (to an absolute file name) the current progress and
estimated time remaining for the evaluation will be appended to it in a format suitable as input for Gnuplot.
The CMake targets described in the previous section provide access to the most high-level actions you might want to perform. For more explicit tasks, direct interaction with the software will be required. The following sections describe how to use the most important components in this project.
A Fair Word of Warning: This code is a research project and not industrial software! While we try to hold its quality up to good standards, the primary reason for this software to be written was so it can be studied. Please do not expect this project to be ready-to-use for any use other than experimenting. Given this, it is hard to draw a line between “user interface” and “implementation details” for this software. You can repeat the experiments for which results were published (and even recreate the respective documents) at the push of a button as described above but be prepared to get confronted with lots of details and internals as soon as you try digging any deeper. That said, you are explicitly invited to do so. This project was made public so others can study our work and ideally benefit from it.
This project uses the following terminology.
- graph — set of nodes and edges (graph theory)
- layout — mapping of vertices to two-dimensional coordinates for a given graph
- property — multi-set of real numbers computed for a given layout
- metric — single scalar number computed for given a layout
- discriminator — function that takes two layouts and outputs a number indicating an aesthetic preference
- fingerprint — fixed-length deterministic value computed for a graph or layout assumed to be practically unique
The source code frequently refers to discriminators as “tests” for historical reasons. This should be cleaned up over time.
This project consists of four major components:
- A collection of command line tools in the
./src/
directory. These tools are written in C++. - A driver which can be used to populate a database with a collection of graphs and layouts, compute properties for
them and build and train the discriminator model as well as competing metrics. The driver can also run as a local
HTTP server to provide insights in the current data. The diver is found in the
./driver/
directory. The driver is written in Python with some XSLT, CSS and JavaScript technology for the web-interface. - The directory
./eval/
contains a set of experiments designed for evaluating the model. Those are implemented using a mix of Python and CMake scripting. - The directories
./report/
,./paper/
and./slides_*
contain sources for various publications and presentations about the project. The build system utilizes the components mentioned above in order to obtain numbers and figures for those publications.
Conceptually, this project uses the following directories.
- The source directory is the root of the source code tree. We'll refer to this directory as
${srcdir}
or “.
” in the following. The driver always assumes${srcdir}
is the current working directory. In other words, you must always invoke the driver script from within the top-level source directory. - The build directory is the root of the build tree. We'll refer to it as
${bindir}
. For an out-of-tree build, you may choose this directory freely. CMake will know about this choice so there is no need to mention it when building the high-level targets described earlier. However, when invoking the driver script directly, you need to provide the location of this directory. (It accepts the--bindir=${bindir}
option for that purpose.) - The configuration directory is the directory where the driver ought to look for configuration files. We shall
refer to it as
${configdir}
. The default for${configdir}
is${srcdir}/config/
but this may be overruled by setting the CMake variableMSC_CONFIG_DIR
. (You either do this by passing the-DMSC_CONFIG_DIR=${configdir}
option when you invokecmake
or via editing the configuration interactively using theccmake
tool.) The driver can be told about this directory via the--configdir=${configdir}
option. - The data directory is the root of the tree where the database files are stored. We'll refer to it as
${datadir}
. The default${datadir}
is${bindir}/data/
but this can be changed via setting the CMake variableMSC_DATA_DIR
and communicated to the driver using the--datadir=${datadir}
option. - The cache directory is a directory where the driver stores non-essential data. It is the only directory that may
be undefined. Its has no default and its value is automatically picked up by the driver from the environment
variable
MSC_CACHE_DIR
. You may delete that directory safely if it grows out of hand.
After building the project, the src
directory will (in sub directories) contain a number of command line tools that
are subsequently invoked by the driver script but may also have merits on their own. The tools are structured by their
purpose in the following sub-directories.
generators
— These tools take “nothing” and output a graph. The directory contains probabilistic graph generators and a program to “import” graphs from a variety of formats.layouts
— Layout algorithms; these tools take a graph and output a layout for it.unitrans
— Unary layout transformations; these tools take a layout and output another layout (or multiple if multiple rates are specified).bitrans
— Binary layout transformations; these tools take two layouts and output another layout (or multiple if multiple rates are specified).properties
— These tools take a layout, compute some property of it and output some data.metrics
— These tools take a layout, compute some metric for it and output some number.visualizations
— These tools take a layout and output a drawing (an image file of some sort).utility
— These tools do various things.
The sub-directory common
contains no programs but C++ code that is shared by all command line tools. The static
library is called libcommon.a
when built. It contains a grab-bag of features needed for this project and is not
intended for use by third-party code although you might find it useful to take individual components out of this library
and use them elsewhere (obeying the requirements imposed by software license, of course).
All command line tools accept input files as positional arguments, reading from standard input if none is provided. If
they produce any output, they write it to standard output or to the file specified via the --output
option. All tools
also have the ability to output “meta” information to a file specified via the --meta
option (the default
is to not output such information). Meta information will always be in JSON format and, unfortunately, you have to run
the program and see what it outputs as the structure is not documented. Despite its name, this information is often not
at all “meta” but the most essential thing the tool produces. In essence, everything that will be processed
by another tool or program will be considered “output” and everything that is of interest to the driver
script will be considered “meta”. The driver script treats output files as opaque, managing but not
interpreting them.
To give you an idea how the tools might be used, consider the following example
$ mosaic --symmetric --nodes=1000 --output=sample.xml.bz2
$ picture --output=sample.svg sample.xml.bz2
which creates a random symmetric “mosaic” graph with approximately 1k nodes and save it as GraphML file
sample.xml.bz2
with bzip2
(Burrows-Wheeler) compression applied. The file is then read again on the second line and
a graphical rendition of the layout is saved as file sample.svg
.
The shell pipeline
$ wget -q -O - 'ftp://math.nist.gov/pub/MatrixMarket2/Harwell-Boeing/bcspwr/bcspwr01.mtx.gz' \
| import --format=matrix-market --simplify --meta=2 STDIO:gzip \
| force --algorithm=fmmm \
| edge-length --kernel=boxed --output=histogram.txt
{ "nodes": 42, "edges": 49, "graph": "f549a2236f459c8c6ea7bb28a7884f31", "native": false, ... }
downloads (using the standard wget
command line utility) a graph from NIST's “Matrix Market” as gzip
(Lempel-Ziv) compressed file, “simplifies” the graph (This operation makes edges undirected and deletes
loops and fuses multiple edges into a single one.) and converts it to the preferred GraphML format, then computes a
force-directed layout for the graph and finally analyzes the distribution of edge lengths in that layout, saving a
histogram as text file histogram.txt
. The command given the --meta=2
option will print additional information to
standard error output (selected by the POSIX file descriptor 2) in JSON format which is partially shown in the above
snippet after the command prompt. (It cannot be printed to standard output which is already used for the pipeline or it
would be invisible and clobber the graph data.) The histogram file could be plotted using a tool like gnuplot
. The
last program could (and probably should) also be instructed to output additional information like the mean or entropy in
JSON format using the --meta
option again which was omitted in the example to avoid confusion. The value for the
graph
key in the shown JSON output is the fingerprint computed for the imported graph.
Please note that both of the above examples omit the directory part of the invoked programs for brevity.
A complete list of all tools is omitted here, please go look at the directories yourself. All of these tools accept a
consistent set of command line options. Please run any tool with the --help
option to see what options and arguments
it accepts / requires.
Whenever a tool accepts a file name, it will also accept a decimal file descriptor. This is especially useful if you
invoke a tool from another process and need more than one pipe to communicate all the data back and forth. Furthermore,
the strings NULL
and STDIO
receive special treatment. The former will be understood as a request to perform no I/O
at all (somewhat like using /dev/null
) and the latter causes standard input or output to be used. The empty string
has the same effect as NULL
and the string “-
” has the same effect as using the string STDIO
. The
reason the more verbose alternatives are provided as well (and are, in fact, recommended) is that the Boost Program
Options library that is used for parsing the command line arguments will behave erratically in some cases when
confronted with the empty string or the string “-
”. It follows that if you should ever want to refer to a
regular file by such a name, you will have to use a construct like ./NULL
or ./42
to circumvent this special
treatment.
Apart from the remarks above, file names also support a special syntax to enable transparent compression. If the file
name contains a colon, the portion after the last colon will be interpreted as a compression algorithm. For example,
the “file name” file.dat:gzip
refers to the file name file.dat
which should be accessed using gzip
compression. Beware that some operating systems use colons in regular file names. In that case, you must always append
another colon at the end. For example, A:graphs\mansion.dat
will cause confusion while A:graphs\mansion.dat:
will
work fine. Using the empty string otherwise has the same effect as not specifying any compression at all and will cause
the compression to be inferred from the file name. If it ends in .gz
it will be assumed that the file is gzip
compressed and if it ends in .bz2
then bzip2
compression will be assumed. If the compression is specified
explicitly, the strings gzip
, bzip2
and none
are accepted and have the obvious meaning. If you like being
verbose, the string automatic
may also be used instead of the empty string to the same effect.
All command line tools that may use non-determinism honor an environment variable MSC_RANDOM_SEED
which, if set, will
be used to act as a deterministic seed for the pseudo random number generator. It may be any sequence of bytes. If
this variable is not set, the program will behave non-deterministic.
Warning: The MSC_RANDOM_SEED
variable is currently only honored by the command line tools but not by the driver.
This has the consequence that if you set it globally (such as by putting MSC_RANDOM_SEED="f~rR9>Zh-1t'MxzVa<nb"
in
your ~/.profile
file) and run the driver script, its graph generation process will livelock, invoking the same
generator tool over and over again, rejecting its output because the graph is already in the database. Making the
driver support a global random seed to make the whole process deterministic is an important but non-trivial open task.
(It is not as trivial as seeding a pseudo random generator in the driver and using it to generate deterministic but
different seeds for each tool invoked because the driver may be interrupted and in this case, the generator would have
to pick up where it left off.)
Some environment variables are accepted by all command line tools and control their resource limits. They all have the
form MSC_LIMIT_${RES}
where ${RES}
is one of the constants defined by the POSIX rlimit interface (spelled in all
upper-case). Consult the manual page for the getrlimit
or
setrlimit
system calls for that purpose. For example, running a command line tool with MSC_LIMIT_STACK=33554432
will cause the process to use a maximum stack size of 32 MiB, unless a lower hard-limit is already imposed in
which case that limit will take effect. The special value of NONE
is interpreted as a request to clear any soft-limit
that might be in effect for the resource (using resources up to the active hard-limit, if any). This feature is only
available on POSIX platforms. On other systems, those environment variables cannot be honored and setting them will
cause an error.
The phantom
tool is also sensitive to the environment variable MSC_DUMP_PHANTOM
which, when set, will be interpreted
as a file name into which to dump the “phantom” graph.
The princomp
tool uses the value of the environment variable MSC_PRINCOMP_ORTHO_TOL
– which should be a small
positive floating-point number ε > 0 to decide whether its results should be discarded because
p1 ⋅ p2 > ε where
p1 and p2 are the determinded first and second principal axes
respectively. The default value is ε 2−10 if the environment variable is not set.
Finally, all command line tools honor the COLUMNS
environment variable in case that a syscall to determine the
terminal width is not available or does not succeed in order to determine the width of the --help
output. This
variable should be set to a positive integer (your shell might do this automatically).
So far, we have been talking about “the driver script”. This was a lie. The driver is not a single script
but a Python package with plenty of modules, some of which are executable. So in order to invoke a certain driver
module ${module}
you would execute the command
$ python3 -m driver.${module} ...
from within ${srcdir}
passing any options and arguments the module expects instead of the ellipsis shown above.
The following driver modules are available:
deploy
— populates the databasehttpd
— runs an HTTP server for the web-interfacecompare
— queries the discriminator model about pairs of layoutsdoctests
— runs Pythondoctests
for the driverarchidx
— prints statistics about graph archivesintegrity
— checks integrity of the database and can try to fix themmodel
— allows access to internals useful for preparing documents
Warning: The integrity
module is not very well tested and might have fatal bugs. Be sure to have a backup of your
(already corrupted) data before you screw it completely.
As everything in this project, the driver modules all support a --help
option which will cause them to print a short
help text and then exit immediately. Please make use of it to get detailed information about the arguments and options
a driver module expects.
The following options are supported by all driver modules.
-B
,--bindir=${bindir}
— root of the build directory where to find executables (default:.
)-C
,--configdir=${configdir}
— search for configuration files in${configdir}
(default:config
)-D
,--datadir=${datadir}
— root of the data directory (can be created, default:data
)-v
,--verbose
— increase the logging verbosity by one level (may be repeated and combined)-q
,--quiet
— decrease the logging verbosity by one level (may be repeated and combined)--log-level=${level}
— set the logging verbosity to one of the well-knownsyslog
levels (by default, the value of the environment variableMSC_LOG_LEVEL
is used which in turn defaults toNOTICE
--help
— show usage information and exit--version
— show version information and exit
If an argument is shown for the long form of an option, the short form of the option accepts that same argument, too.
This section is about setting up the experiment, not build system configuration.
The configuration read by the driver script is found in the ./config/
directory. The Git repository contains two
configurations ./config-light/
and ./config-heavy/
with the former intended for a quick check and the latter for a
thorough experiment. ./config
is a symbolic link to ./config-light/
. The files in ./config-heavy/
and
./config-light/
might serve as a good starting point for writing your own configuration.
All *.cfg
files have in common that blank lines are ignored as well as everything after a “#
” character.
The file graphs.cfg
controls the graphs that will be generated. Its format is a table where the first column lists
the graph generator and the subsequent columns the desired number of graphs per size class. All columns except the
first (which must not have a title) must have a title that specifies the size class. For example, the following
configuration
SMALL MEDIUM
LINDENMAYER 10 5
ROME 20 *
specifies that 10 small and 5 medium-sized LINDENMAYER
graphs are desired as well as 20 small graphs from the ROME
collection and any medium graphs in there. Using an asterisk makes only sense for generators that import from a finite
collection. If you build the deploy
target over and over again, you might get tired from the driver scanning the
graph archives each time. Setting the MSC_QUICK_ARCHIVE_IMPORT
environment variable to a positive integer will cause
the driver to take a short-cut and assume that the graphs that are
currently in the database are all that can be found in the archive and not scan it again.
Warning: The driver parses the “table” in this file by interpreting each line as a list of tokens (one per column). The offset inside the file does not matter. Therefore, you cannot leave table cells empty. It is recommended that you format the file with aligned columns as a table to improve human readability but doing so is not required as far as the driver is concerned.
The following graph sizes are defined. A graph with n vertices belongs to a given size class if and only if nmin ≤ n < nmax is true.
Size Class | nmin | nmax |
---|---|---|
TINY |
0 | 10 |
SMALL |
10 | 100 |
MEDIUM |
100 | 1,000 |
LARGE |
1,000 | 100,000 |
HUGE |
100,000 | ∞ |
The size classes are specified by the enumerator GraphSizes
which is defined in the file
./driver/constants.py
. In case of doubt, please refer to this definition and consider
the information provided by the table above as outdated.
The following graph generators are defined:
SMTAPE
— imports graphs from theSMTAPE
set of the Harwell-Boeing collection in NIST's Matrix MarketPSADMIT
— imports graphs from thePSADMIT
set of the Harwell-Boeing collection in NIST's Matrix MarketGRENOBLE
— imports graphs from theGRENOBLE
set of the Harwell-Boeing collection in NIST's Matrix MarketBCSPWR
— imports graphs from theBCSPWR
set of the Harwell-Boeing collection in NIST's Matrix MarketRANDDAG
— imports graphs from theRANDDAG
collection hosted ongraphdrawing.org
NORTH
— imports graphs from theNORTH
collection hosted ongraphdrawing.org
ROME
— imports graphs from theROME
collection hosted ongraphdrawing.org
IMPORT
— imports graphs (and optionally native layouts) from arbitrary user-defined sources specified in the user-provided${configdir}/imports.json
configuration fileLINDENMAYER
— probabilistic algorithm creating graphs with native layouts utilizing a stochastic L-systemQUASI3D
— probabilistic algorithm creating graphs with native layouts from a random projection of a regular 3-dimensional lattice onto a 2-dimensional planeQUASI4D
— probabilistic algorithm creating graphs with native layouts from a random projection of a regular 4-dimensional lattice onto a 2-dimensional planeQUASI5D
— probabilistic algorithm creating graphs with native layouts from a random projection of a regular 5-dimensional lattice onto a 2-dimensional planeQUASI6D
— probabilistic algorithm creating graphs with native layouts from a random projection of a regular 6-dimensional lattice onto a 2-dimensional planeGRID
— probabilistic algorithm creating graphs with native layouts as regular n × m girdsTORUS1
— probabilistic algorithm creating graphs as regular n × m girds connected to form a 1-torus (a cylinder)TORUS2
— probabilistic algorithm creating graphs as regular n × m girds connected to form a 2-torus (a doughnut)MOSAIC1
— probabilistic algorithm creating graphs with native layouts by recursively splitting the facets of an initial regular polygon.MOSAIC2
— likeMOSAIC1
but the amount of randomness is reduced to produce more symmetric results.BOTTLE
— probabilistic algorithm creating graphs with native layouts as axonometric projections of 3D-meshes of random bodies of revolutionTREE
— probabilistic algorithm creating random treesRANDGEO
— Generates a random geometric graph using a procedure similar to the one presented by Markus Chimani at GD'18
These constants are specified by the enumerator Generators
which is defined in the file
./driver/constants.py
.
Download Cache: Downloaded files can be cached locally in a directory specified by the environment variable
MSC_CACHE_DIR
or the the system's default temporary directory (/tmp/
on POSIX) if said variable is not set. If you
set the environment variable to the (absolute) path of a directory that (unlike /tmp/
) won't be wiped routinely, the
driver will download archives only once. Repeated downloads not only slow down the graph generation process, they also
put unnecessary load on the servers of your fellow researchers who are kindly providing the archives to the public at no
charge. In the worst case, a server operator might consider your repeated download attempts abusive and blacklist your
IP address. Finally, using a cache also enables you to work offline without any internet connection (once the cache is
populated, that is, of course).
The file layouts.cfg
specifies the desired layouts to compute. Its format is a list of layout algorithms followed by
a list of graph sizes for which the algorithm should be applied. For example, the configuration
NATIVE ...
FMMM ... LARGE
STRESS TINY SMALL
specifies that native layouts should be “computed” for graphs of all sizes (the ellipsis) while FMMM
layouts should be computed for graphs up to and including LARGE
size and STRESS
layouts should be computed for
TINY
and SMALL
graphs only. The ellipsis can be used in three ways. If it is used alone, it selects all size
classes. If it is used as the first or last word in a row, it refers to all size classes up to (and including) or all
size classes from and above the following or preceding class respectively. If an ellipsis is used between two other
size classes, it selects all size classes in between.
The following layout algorithms are defined:
NATIVE
— this is not a layout “algorithm” but merely a request to use the “native” layout (if any) provided by the graph generatorFMMM
— fast Multipole Multilevel layout algorithmSTRESS
— energy-based layout using stress minimizationDAVIDSON_HAREL
— Davidson-Harel layout algorithmSPRING_EMBEDDER_KK
— spring-embedder layout algorithm by Kamada and KawaiPIVOT_MDS
— pivot MDS (multi-dimensional scaling) layout algorithmSUGIYAMA
— Sugiyama's layout algorithmRANDOM_UNIFORM
— garbage layout algorithm assigning independent random coordinates (drawn from a uniform distribution) to each vertexRANDOM_NORMAL
— garbage layout algorithm assigning independent random coordinates (drawn from a normal distribution) to each vertexPHANTOM
— garbage layout algorithm using the coordinates of a force-directed layout computed for a random “phantom” graph (which has the same number of nodes and edges)
These constants are specified by the enumerator Layouts
which is defined in the file
./driver/constants.py
.
The files interpolation.cfg
and worsening.cfg
control the computation of “interpolated” and
“worsened” layouts respectively. Their format is the same: a list of interpolation or worsening algorithms
followed by a list of numbers (between 0 and 1) that select the rates at which interpolations or worsenings should be
computed using that algorithm. For example, the configuration
PERTURB 0.15 0.25
MOVLSQ 0.10
specifies that worsened layouts using the PERTURB
algorithm shall be computed at rates of 15 % and 25 %
while worsened layouts using the MOVLSQ
algorithm should be computed only for a rate of 10 %.
The following layout interpolation algorithms are available:
LINEAR
— uses simple linear interpolation between vertex coordinatesXLINEAR
— likeLINEAR
but tries to reduce paradox effects by aligning the principal exes of the two parent layouts beforehand
These constants are specified by the enumerator LayInter
which is defined in the file
./driver/constants.py
.
The following layout worsening algorithms are available:
FLIP_NODES
— flips the coordinates of randomly selected pairs of nodesFLIP_EDGES
— flips the coordinates of randomly selected pairs of adjacent nodesMOVLSQ
— deforms the entire drawing using affine transformations based on moving least squares as described by Schaefer et al.PERTURB
— adds white noise to vertex coordinates
These constants are specified by the enumerator LayWorse
which is defined in the file
./driver/constants.py
.
The properties-disc.cfg
, properties-cont.cfg
and metrics.cfg
files use the same format as the layouts.cfg
file
only that they specify (discrete or continuous) properties and metrics rather than layouts to be computed. For example,
the configuration file
ANGULAR SMALL ... HUGE
EDGE_LENGTH MEDIUM
specifies that the ANGULAR
property shall be computed for all layouts of graphs from small to huge size (both
inclusive) and the EDGE_LENGTH
property shall be computed for layouts of medium size graphs only.
The following properties are available:
RDF_GLOBAL
— pairwise distances between nodesRDF_LOCAL
— pairwise distances between nodes separated in the graph no further than a given thresholdANGULAR
— angles between edges incident to the same nodeEDGE_LENGTH
— edge lengthsPRINCOMP1ST
— node coordinates along the first (major) principal axisPRINCOMP2ND
— node coordinates along the second (minor) principal axisTENSION
— quotients of Euclidian distance in the layout and graph-theoretical distance between vertices
These constants are specified by the enumerator Properties
which is defined in the file
./driver/constants.py
.
The following metrics are available:
STRESS_KK
— stress as defined by Kamada and Kawai with a desired edge length of 100STRESS_FIT_NODESEP
— stress as defined by Kamada and Kawai with a desired edge length chosen to minimize the result valueSTRESS_FIT_SCALE
— stress as defined by Kamada and Kawai with a desired edge length of 100 computed after scaling the layout homogeneously to minimize the result valueCROSS_COUNT
— number of edge crossingsCROSS_RESOLUTION
— minimal angle between any two intersecting edgesANGULAR_RESOLUTION
— minimal angle between any two edges incident to the same nodeEDGE_LENGTH_STDEV
— standard deviation of edge lengths
These constants are specified by the enumerator Metrics
which is defined in the file
./driver/constants.py
.
Note that the HUANG
discriminator cannot work unless the CROSS_COUNT
, CROSS_RESOLUTION
, ANGULAR_RESOLUTION
and
EDGE_LENGTH_STDEV
metrics are available.
The puncture.cfg
file is simply a list (one item per line) of properties that should be deliberately set to zero when
training and testing the model. For example, the configuration
RDF_GLOBAL
RDF_LOCAL
will cause the entries relating to the properties RDF_GLOBAL
and RDF_LOCAL
(continuous or discrete) to be
“punctured” form the feature vectors. This is really only useful for experimenting and the environment
variable MSC_PUNCTURE
may be used to avoid puncturing by accident. The variable is expected to be set to a
non-negative integer specifying the number of punctured properties. In the above example, you would set
MSC_PUNCTURE=2
when running the driver. It will then check that there are indeed exactly two properties punctured and
trigger an error if the active configuration disagrees. If the environment variable is unset, no check can be performed
an a warning will be printed.
See above for a list of available properties.
There is one last configuration file, imports.json
that, as its name suggests, is in JSON format.
This file specifies where to find import sources for graphs (for the IMPORT
generator). Its format is either a single
JSON object or a JSON array of JSON objects. Using a single object is just a convenience for using an array with only
one element. Each object specifies one archive which will be considered in turn. It is therefore most useful to
specify multiple import sources only in conjunction with selecting to import all available graphs (using the *
in the
graphs.cfg
file). An import source must always have a type. The remaining fields permissible (and some required)
in the JSON object depend on the type of the archive. The following types are defined and accept or require the
attributes mentioned as nested items. Items with no default value mentioned are mandatory.
-
DIR
— refers to a local directorytype
(string) — must be the textDIR
directory
(stringa,b) — specifies the directory (path) to scan for graph filesformat
(stringd) — specifies the file format used by the archivecompression
(stringc, default:NONE
) — specifies the compression (if any) applied to the files in the archivepattern
(string, default:*
) — specifies a POSIX globbing expression by which to select files in the directoryrecursive
(boolean, default:false
) — selects whether or not the directory shall be scanned recursively for graph files to importlayout
(boolean, default:null
) — specifies whether or not the graph files have an associated native layoutesimplify
(boolean, default:false
) — specifies whether or not to “simplify” the imported graphs by pruning multiple edges, loops and making the graph undirected
-
TAR
— refers to a tarball specified via a URLtype
(string) — must be the textTAR
url
(string) — specifies the URL of the tarballformat
(stringd) — specifies the file format used by the archivecompression
(stringc, default:NONE
) — specifies the compression (if any) applied to the files in the archivecache
(stringb, default: don't cache) — specifies a file name (not a full path) to use for caching the tarball locally which is useful if the URL is not afile://
checksum
(string, default: don't verify checksum) — specifies the cryptographic hash algorithm and hex digest to expect for the tarball in the format${algo}:${hash}
where${algo}
is one of the strings understood by thehashlib
module from the Python standard library and${hash}
is a hexadecimal encoding of the expected checksumpattern
(string, default:*
) — specifies a POSIX globbing expression by which to select files in the directorylayout
(boolean, default:null
) — specifies whether or not the graph files have an associated native layoutesimplify
(boolean, default:false
) — specifies whether or not to “simplify” the imported graphs by pruning multiple edges, loops and making the graph undirected
-
URL
— refers to a collection of URLs of individual graphstype
(string) — must be the textURL
urls
(array of strings) — lists the URLs of the graph files to considerformat
(stringd) — specifies the file format used by the archivecompression
(stringc, default:NONE
) — specifies the compression (if any) applied to the files in the archivelayout
(boolean, default:null
) — specifies whether or not the graph files have an associated native layoutesimplify
(boolean, default:false
) — specifies whether or not to “simplify” the imported graphs by pruning multiple edges, loops and making the graph undirectedname
(stringb, default:www
) — specifies an informal name for the archive and should be a valid identifier (\w+
)cache
(boolean, default:false
) — specifies whether or not to cache the downloaded files locally in a database
-
NULL
— is a dummy archive that contains no graphstype
(string) — must be the textNULL
(That's right: the string literal"NULL"
as opposed to the special JSON valuenull
.)
a) Environment variables can be expanded using shell syntax (e.g. ${HOME}/work/
might get
expanded to /home/5gon12eder/work/
; the curly braces may be omitted).
b) If the first character is a “~
” it will be expanded to a user's home directory
(e.g. ~/work/
might get expanded to /home/5gon12eder/work/
and ~foo/work/
to /home/foo/work/
).
c) Acceptable values for the compression
attribute are GZIP
, BZIP2
and NONE
with the
obvious meaning. These constants are case-insensitive.
d) Acceptable values for the format
attribute may be found by invoking the import
tool
(${bindir}/src/generators/import
) with --show-formats
option. As of this writing, the following formats were
supported. The names are case-insensitive. Links to the official documentation of the format are provided where
available. Please also refer to the reference documentation of the
ogdf::GraphIO
API.
BENCH
CHACO
DL
— UCINET DL formatDMF
— DIMACS Max Flow ChallengeDOT
GDF
— GUESS Database FileGD_CHALLENGE
— Graph Drawing Challenge: Area Minimization for Orthogonal Grid LayoutsGEXF
— Graph Exchange XML FormatGML
— Graph Modelling LanguageGRAPH6
— the Graph6 format represents a (preferable dense or small) simple undirected graph as a string containing printable characters between 0x3F and 0x7EGRAPHML
— Graph Markup LanguageLEDA
— LEDA Native File Format for GraphsMATRIX_MARKET
PLA
PM_DISS_GRAPH
— graph file format from Petra Mutzel's PhD thesisROME
— Rome-Lib formatRUDY
STP
— SteinLib STP Data FormatTLP
— Tulip software graph formatYGRAPH
e) If layout
is set to true
then all graphs must have an associated layout which will be
treated as the native layout for the respective graph. If the archive contains multiple versions of the same graph, all
but one will be discarded. If layout
is set to false
, only the graph data will be imported even if layout data
would be available. Duplicate graphs will be discarded. If layout
is set to the special value null
(which is the
default) then graphs are assumed to have no native layout but if a graph happens to have associated layout information,
it will be imported and (later) treated as an unclassified layout. (Such layouts will not be used for training or
testing and have no implicit quality assigned.) In this case, if the archive contains multiple versions of the same
graph, all layouts (if any) will be imported.
Unlike vanilla JSON, the format used for the imports.json
file allows you to use simple comments introduced by two
consecutive forward-slashes as the first non-white-space characters on a line and will cause the entire line to be
ignored. You can, therefore, write
{
// "bar" : "Sorry, not today!",
// foo is always good
"foo" : 42
}
which will be interpreted as if you had written { "foo" : 42 }
instead.
To get an idea, you might want to have a look at the file
./driver/resources/imports.json
which provides the definitions for the well-known
graph archives that are supported out-of-the-box. The one but important difference to be aware of is that this file
specifies archives as the values in a JSON object (with the key being the name of the well-known import source) while
the configuration file you may write is supposed to contain the archive specifications as the elements of a JSON array
(with no keys) instead.
Here is a summary of all environment variables that are honored by the driver.
MSC_MODEL_DEBUGDIR
— If set to a directory, the driver will dump human-readable information about the built neural network into it.MSC_NN_TEST_SUMMARY
— If set to a file name, the driver will write to it a summary of how the various discriminators performed on the test data set in JSON format.MSC_PUNCTURE
— If set to a non-negative integer N, the driver will check that exactly N properties are punctured (seepuncture.cfg
).MSC_LOG_LEVEL
— If set to one of the well-known syslog levels, this defines the initial verbosity of the driver (may be altered by passing additional--verbose
or--quiet
options or overridden by the--log-level
option).MSC_CACHE_DIR
— If set to a directory, the driver will use it to cache downloaded files and rendered pictures served by the local HTTP server.MSC_QUICK_ARCHIVE_IMPORT
— If set to a positive integer, graph archives won't be scanned for more graphs if the multiplicity “*
” (seegraphs.cfg
) was selected for the desired number of graphs. Setting this variable to 0 has the same effect as not setting it at all and will cause the archive to be scanned.
The following environment variables can be used to specify external programs the driver will use.
First Choice | Second Choice | Default Value |
---|---|---|
MSC_GNUPLOT |
GNUPLOT |
gnuplot |
MSC_IMAGE_MAGICK |
IMAGE_MAGICK |
convert |
MSC_ZCAT |
ZCAT |
gzip -dc |
MSC_BZCAT |
BZCAT |
bzip2 -dc |
As the driver starts up, it will first check the environment variable mentioned as “First Choice” or, if
that is not set, check the environment variable mentioned as “Second Choice” next or, if that isn't set
either, use the value listed as “Default Value”. At verbosity INFO
or higher, the driver will report the
result of these checks. In case of doubt, the information printed there will be more up-to-date than this README
document.
Once a value is obtained by the procedure described above, it will be tokenized according to the Shell rules and used
thenceforth. The first token must be an absolute path (that is, use /usr/bin/zcat
not zcat
). For security reasons,
the individual tokens must not require any escaping, otherwise, the driver will reject the command. (That is, you
cannot have "/path/with spaces/foo" \(lol\)
as a command. You must create a wrapper script if your system should
happen to be this wicked.)
This repository also includes the TeX source code for the written thesis, presentation slides and the paper submitted to
GD'18. What is more, it includes a wealth of auxiliary scripts that automatically extract experimental results and
format them for inclusion in the TeX file. Therefore, if you re-run the experiments (build the deploy
and eval
targets) and then typeset any of the documents, they will automatically contain the numbers corresponding to the results
of your experiments. Please note that while doing this is kind of fun and might help you convince yourself that the
published numbers are sound, you must not publish the documents you've typeset. Typesetting your own version with
different inputs constitutes the creation of a derived work which the CC BY-NC-ND license does not allow you to
redistribute. This restriction, unfortunately, is necessary not only for legal reasons but also in order to avoid
multiple documents with slightly different numbers from circulating.
Typesetting either of the documents is very similar. On the CMake level, you build the report
or paper
target or
the slides-kit
and slides-gd18
targets. Each of these targets ${target}
is accompanied by a ${target}-clean
target which removes auxiliary files created by TeX for the respective targets. If you edit any TeX file and are
unfortunate enough, removing those files might be required in order for TeX to be able to typeset the document
successfully again, as any TeX'nician will have experienced from time to time.
The actual typesetting is accomplished by the ./utils/typeset.py
script. This Python script creates a symlink farm to
enable out-of-tree TeX builds and takes care of invoking the various TeX tools in the appropriate order and required
number of times. By default, all documents draw plots and graph layouts directly in TeX using the tikz
package. This
is very cool but – unfortunately – takes a lot of time and a lot of memory. Since neither TeX, pdfTex or
XeTeX (at least not from the TeXLive 2018 distribution) are capable of dynamically allocating the required amount of
memory, the only TeX engine that will work out-of-the-box is LuaTeX. Therefore, the build system will typeset all
documents using LuaLaTex which is pretty darn slow. Expect build times of several minutes or more.
On-Demand Downloads: The documents include some example graphs taken from public graph collections which will be downloaded on-demand when the documents are typeset. Unless the files are already available from a local cache, access to the internet will be required. See the section “Automatic File Downloads” for more information. Some documents may require additional downloads (other than graphs) that will be mentioned below.
TeX'nical Troubleshooting: If the environment variable MSC_TEX_REPORT_HTML
is set (to an absolute file name), the
./utils/typeset.py
script will write a report (in HTML format) to that file which contains a nicely formatted
combination of all relevant log files. In case of an error, the HTML anchor #error-1
will land you on the first error
(and so forth for any subsequent errors).
Typesetting Documents Without Experimental Results: Since building the eval
target always runs the full
experiments from scratch (which takes an enormous amount of time), it is not an explicit dependency of the report
,
paper
and slides-*
targets although you have to build it once before those targets can be built. As a cheataround,
you may set the environment variable MSC_LAZY_EVAL_OKAY
to a positive integer which will produce versions of the
publications with dummy values substituted for the actual evaluation results.
Official Logos: In order to avoid copyright and trademark issues, no official logos are included in the repository.
Instead, a transparent picture of the same size will be used. You can provide the absolute paths of those logos you
have handy by setting the CMake variables MSC_LOGO_KIT
, MSC_LOGO_ALGO
and MSC_LOGO_IOSB
accordingly. The logos
must be in PDF format. (The ./maintainer/configure
script will recognize an environment variable with the same name
and pass its value on to CMake.)
Checking Assertions via TeX: Some (admittedly very few) textual claims can be verified by automatic assertions added
to the LaTeX code via \directlua
magic. In order to enable these checks, set the environment variable
MSC_TEX_ASSERT
to a positive integer. You should not combine this with setting MSC_LAZY_EVAL_OKAY
as you cannot
expect correct results when cheating in the first place.
Building the report
target will download the Roboto font from
Google Fonts which is licensed under the Apache License, Version 2.0.
Furthermore, a photo will be downloaded from the Wikimedia Commons for inclusion in the typeset document. This file has been released into the public domain by its author.
Building the paper
target will download the two files llncs.cls
and splncs04.bst
which are copyrighted by
Springer Verlag for its“Lecture Notes in Computer
Science”. The latter file
states that it is available under the LaTeX Project Public License distributed from CTAN archives in directory
macros/latex/base/lppl.txt
; either version 1 of the License, or any later version. The other file lacks a copyright
notice but it is assumed that the same conditions apply.
There is the possibility to build a “cache” of those pictures that are resource hungry to render. The cache
consists of a directory that contains pre-rendered PDF or EPS versions of the pictures that are thenceforth included
as-is as opposed to be re-drawn from first principles each time the main TeX document is typeset. This feature is
implemented using the externalize
library of tikz
. The cache can be built via the paper-cache
target and removed
again via the paper-clean
target. If a cache exists, the pre-rendered pictures will be utilized automatically.
Warning: The cache will not be re-built as needed. If any TeX code is changed that might affect the rendering of
a picture, it is necessary to build the paper-cache
target again in order to update the cached pictures. Otherwise
building the paper
target will insert outdated pictures into the document. Unfortunately, there is no reliable way to
track dependencies through TeX code.
There also is the paper-pubar
target that creates an archive ${bindir}/paper/graphstudy.zip
with the relevant files
for the paper. Please note that this archive contains multiple files at the top-level as opposed to a single directory
in which all actual files are found.
You will notice that some enumerators are spelled differently in the paper than in the other publications, the source
code or this README
document. Those renamings were applied driven by the desire to use constant names that are
shorter and, in some cases, politically more correct. You can find the mappings of “source code” to
“paper” names in files called ./paper/rename-*.txt
all of which have the format of one renaming per line
with the “source code” name on the left and the “paper” name on the right. Blank lines are
ignored as well as everything following the first “#
” character on a line.
The presentation slides built by the slides-kit
and slides-gd18
target use the KIT's beamer theme. KIT members
can obtain it from the KIT's intranet. Unfortunately, this resource is
not available to the general public even though the files state that they are subject to the terms of the LaTeX Project
Public License, either version 1.2 of this license or (at your option) any later version. Additional trademark
restriction might apply. The files for this theme are neither included in the repository nor downloaded during the
build process. Rather, the theme is assumed to be installed on the host system. If doing so is inconvenient to you,
you may instead prepare a ZIP archive with the following files (not all of which might be strictly necessary).
KIT16.clo
KIT18.clo
KIT20.clo
KIT22.clo
KIT24.clo
KITcolors.sty
KITdefs.sty
KITmcfloat.sty
beamercolorthemeKIT.sty
beamerfontthemeKIT.sty
beamerinnerthemeKIT.sty
beamerouterthemeKIT.sty
beamerthemeKIT.sty
kit_logo_de_1c_schwarz.pdf
kit_logo_de_4c_positiv-rgb.pdf
kit_logo_de_4c_positiv.pdf
kit_logo_en_1c_schwarz.pdf
kit_logo_en_4c_positiv-rgb.pdf
kit_logo_en_4c_positiv.pdf
The archive must contain the files directly at the top-level – not within a sub-directory. Tell CMake (at
configuration time) the location of this ZIP archive via setting the variable MSC_KIT_BEAMER_ZIP
to the absolute path
where the archive can be found and it will get extracted into the build directory as needed without clobbering the
global TeX installation on your system. (The ./maintainer/configure
script will recognize an environment variable
with the same name and pass its value on to CMake.)
There are three types of tests:
-
Unit Tests test individual functions form the C++ support library
libcommon.a
. For each component${foo}
in that library (consisting of a header file./src/common/${foo}.hxx
and a source file./src/common/${foo}.cxx
and, optionally, a file./src/common/${foo}.txx
with inline C++ code), there exists a unit test file./test/unit/${foo}.cxx
which will be compiled into an executable${bindir}/test/unit/test-${foo}
. Invoking that executable with no arguments will run all unit tests for that component. A short message will be printed for each exercised test as well as a summary of the overall test harness. If a test fails, a more detailed error will be printed. -
Doctests are the way unit testing is done in the driver script. They can be exercised by running the
doctest
driver module (i.e.python3 -m driver.doctests
). It accepts a--help
option that informs you about different ways to run the tests. -
CLI Tests test the command line interface of a program. They are defined in the same
CMakeLists.txt
file that defines the executable. These tests are usually fairly trivial but make sure that the program can be invoked at all. -
System Tests test the overall functioning of the driver script.
Either of the following two commands will exercise all of the above tests.
$ cmake -E chdir ${bindir} ctest -T Test # using CTest
$ cmake --build ${bindir} --target test # using CMake
The following environment variables are honored by the unit test driver:
-
MSC_RANDOM_SEED
— This environment variable may be set to an arbitrary byte sequence in order to make unit tests deterministic. Not all tests honor this variable at the moment, though. -
MSC_TEST_ANSI_TERMINAL
— Enables colorized output (using ANSI escape sequences) if set to a positive integer and disables it if set to0
. The unit test driver is currently not smart about figuring out whether the output terminal might support ANSI escape sequences or even is a terminal to begin with. If this variable is not set, no colorized output will ever be produced and if it is set, its value will be definitive. -
MSC_TEST_ANSI_COLOR_SKIPPED
— If colorized test output is enabled, the value of this environment variable will be interpreted as a decimal digit specifying an ANSI terminal color to use for skipped unit tests. The default value is 3 (yellow). If colorized test output is not enabled, this environment variable has no effect. -
MSC_TEST_ANSI_COLOR_FAILED
— If colorized test output is enabled, the value of this environment variable will be interpreted as a decimal digit specifying an ANSI terminal color to use for failed unit tests. The default value is 1 (red). If colorized test output is not enabled, this environment variable has no effect. -
MSC_TEST_ANSI_COLOR_ERROR
— If colorized test output is enabled, the value of this environment variable will be interpreted as a decimal digit specifying an ANSI terminal color to use for unit tests that encountered hard errors. The default value is 5 (purple). If colorized test output is not enabled, this environment variable has no effect. -
CTEST_OUTPUT_ON_FAILURE
(standard CTest setting) — If set to 1, the individual output of failed tests will be shown. Setting it to 0 turns on the default behavior of not showing the output of individual tests. -
CTEST_PARALLEL_LEVEL
(standard CTest setting) — If set to a positive integer N, run up to N tests in parallel. (Tip: You may considerexport
'ingCTEST_PARALLEL_LEVEL=$(nproc)
in your~/.profile
to utilize all of your CPUs whenever you're running CTest.)
Several unit tests are sensitive to specific environment variables that will usually cause them to output some debugging information. You will recognize these environment variables when reading that unit test code which you will likely do after it failed and in this case, you'll understand what effect the variable has and why you might want to utilize it.
Copyright (C) 2018 Moritz Klammler
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.
- Explain how to find out what constants an enumerator has (driver)
- Give an overview how the C++ code is organized
- Give an overview how the driver code is organized
- Write something about benchmarking (or delete the
benchmarks
sub-directory?) - Give pointers to potential contributors
- Explain how to “bring your own” graph generator, layout / worsening / interpolation algorithm, property or metric
- Can I also “bring my own” discriminator?