Skip to content

Source code for my master's thesis (Aesthetic Value of Graph Layouts: Investigation of Statistical Syndromes for Automatic Quantification) and subsequent GD'18 paper (Aesthetic Discrimination of Graph Layouts).

License

Notifications You must be signed in to change notification settings

suntzuisafterU/msc-graphstudy

 
 

Repository files navigation

Aesthetic Discrimination of Graph Layouts

This package contains the all code for Moritz Klammler's master's thesis Aesthetic Value of Graph Layouts: Investigation of Statistical Syndromes for Automatic Quantification and the subsequent paper Aesthetic Discrimination of Graph Layouts by Moritz Klammler, Tamara Mchedlidze and Alexey Pak. It is self-contained to reproduce all experiments from scratch and even typeset the written thesis, presentation slides and conference paper automatically incorporating the most recent results.

Publications and Citation

You may cite this work (the source code repository) like so:

Klammler, M. et al.: Source code for aesthetic discrimination of graph layouts, https://github.com/5gon12eder/msc-graphstudy/

The following BibTeX entry may come in handy:

@Misc{GitHubRepo,
  author = "Klammler, Moritz and others",
  title  = "Source Code for Aesthetic Discrimination of Graph Layouts",
  url    = "https://github.com/5gon12eder/msc-graphstudy/",
}

Here is a list of publications about this work:

  1. Klammler, M.: Aesthetic value of graph layouts: Investigation of statistical syndromes for automatic quantification. Master's thesis, Karlsruhe Institute of Technology (2018), http://klammler.eu/msc/

  2. Klammler, M., Mchedlidze, T., Pak, A.: Aesthetic Discrimination of Graph Layouts. 2018; http://arxiv.org/abs/1809.01017

The second item is about to appear in the Proceedings of the 26th International Symposium on Graph Drawing, Barcelona, Spain, 2018.

How to Use this Document

Quick Help: All executables in this project (including, in particular, all utility scripts mentioned in this README document) support a --help option that will provide you with a quick usage summary. This document does not necessarily repeat all the information from that help message so please check it out.

Convention for Paths: When this document refers to a file in the repository, paths starting with a “.” are understood to be relative to the root of the version-controlled directory tree. Paths ending with a slash refer to directories. Absolute paths (starting with a “/”) refer to absolute paths on the host system and for paths which are just a simple file name it will be understood from the context to what directory (if any) they are relative. Globbing expressions (e.g. example-*.c) will be used to refer to a set of (zero or more) files.

Convention for Placeholders: This document uses “shell expansion” for placeholders. For example, we might say “The object identified by ${id} will be accessible though the URL http://localhost:${port}/objects/${id}/” where ${id} was just introduced and you're assumed to understand from the context that ${port} refers to whatever port the web-server is listening at.

Convention for Interactive Shell Sessions: This document shows some illustrative examples of shell interaction. In those, lines starting with a “$” sign introduce commands a user (i.e. you) would enter. If the line is too long to fit, it might be continued on the next line with the previous line ending with a backslash character. Comments may be inserted using the usual syntax. Lines which are neither comments nor commands are output expected to be generated by the commands.

Here is an example showing a command, a comment and some example output:

$ date -R  # RFC 5322 format
Wed, 16 Jul 1969 13:32:00 +0000

Here is another example showing line continuation and shortened command output.

$ wget --no-verbose -O - 'https://raw.githubusercontent.com/5gon12eder/msc-graphstudy/master/README.md'   \
      | grep -Eo '(http[s]|ftp)://[-a-zA-Z0-9@:%_+.~#?&//=]+'                                             \
      | sort -u
http://arxiv.org/abs/1809.01017
http://klammler.eu/msc/
https://github.com/5gon12eder/msc-graphstudy/
https://raw.githubusercontent.com/5gon12eder/msc-graphstudy/master/README.md
...

If an example shows no output, this does not necessarily imply that the shown command is not expected to produce any. It might also just be omitted because it is not relevant to the example. Output may also be shown in a shortened form.

Copyright

The primary author of this work is Moritz Klammler who wrote the code and some prose during the preparation of his master's thesis and subsequent employment. He owns the copyright for a large fraction of the work in this repository. Some utility programs included into the repository were written by Moritz Klammler in the past for different purposes. For the part of the work that (1) was written upon the request of his later employer and (2) is software, the Karlsruhe Institute of Technology, namely the Algorithmics Working Group 1 at the Institute of Theoretical Informatics (Postfach 6980, 76128 Karlsruhe, https://i11www.iti.kit.edu/) is the copyright holder. Since most files were partially written under employment and partially in free time, they have a mixed copyright ownership. The copyright notices in the comment at the top of each file aims to prescribe the situation of the individual file as faithful as possible. Finally, the paper submitted to GD'18 and included in this repository was co-authored by Tamara Mchedlidze and Alexey Pak together with Moritz Klammler who collectively own the copyright of the prose files.

Unless mentioned otherwise, all files in this repository are provided under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the license, or (at your option) any later version. You can find the full text of the license in a file called ./COPYING_GPL3.txt or ./LICENSE as well as online.

Some source files were added to this repository for convenience but not originally written for this project. They are usually provided under a more flexible free software license known as the “MIT” or “X11” license. If so, the comment at the top of the source file will mention that, including a full reproduction of the (very short) license text. The text of this license can also be found in the file ./COPYING_MIT.txt as well as online.

A number of small auxiliary files is provided under an even less restrictive “all-permissive” license. The files to which this applies bear a comment at the top which says: “Copying and distribution of this file, with or without modification, are permitted in any medium without royalty provided the copyright notice and this notice are preserved. This file is offered as-is, without any warranty.” This is exactly what it is.

The prose files (but none of the functional code) in the ./report/, ./paper/ and ./slides_* directories are licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License the text for which can be found in a file called ./COPYING_CC4-BY-NC-ND.txt as well as online. The files to which this restrictive license applies all mention this in a comment at the top of the file.

Finally, this README document is published under the terms of the GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled “GNU Free Documentation License” and can also be found in a file called ./COPYING_GFDL.txt as well as online.

Tip: Running the script ./maintainer/copyright with no arguments from the top-level source directory (of a Git clone) will print the license used for each file in the repository.

Requirements

Operating Systems and Portability

This project aims to be portable but was, so far, only tested on Arch Linux and Parabola GNU/Linux systems. We will try to address portability issues to GNU/Linux systems once we are made aware of them. Portability to Microsoft Windows, Apple Macintosh or other proprietary and non-POSIX systems is desirable but not a priority this project will enforce at all costs. If possible, graceful degradation will happen with the core features remaining available.

In any case, this project relies on an up-to-date software environment. If your toolchain is outdated, the code will not work and this won't be “fixed” except by you updating your computer. We might code around some issues that affect the people working directly on this project but won't go to great lengths hacking around compiler bugs, missing library features or deficits in tools if we know that they are already fixed upstream. Recent versions of all relied-upon tools are readily available as Free Software.

Any accidental dependency on non-free software is considered a bug and will be fixed. Please report any such issues.

System Resources

Building this project and running all experiments will consume a significant amount of computation and storage resources. You should set aside a few dozens of gigabytes of disk storage and a couple of days of computing time when using the configuration found in the ./config-heavy/ directory (see below).

Since the experiments create hundreds of thousands of fairly small files, be sure to have a file system that can cope with that. (The files will be stored in a directory tree of adequate depth so you need not worry about file systems that cannot handle a huge number of files in a single directory.) It might also be a good idea to use a fast hard drive or SSD for this purpose.

Depending on the size of the graphs you wish to process, you might also need a significant amount of RAM. This should not be an issue for the provided configurations which deliberately restrict the database to only medium-sized graphs. To give you an idea: The most demanding computations are processing n × n matrices of real numbers for a graph with n vertices. This means that for a graph with 50k nodes, the program will consume RAM in excess of 18 GiB. Since the algorithm has a complexity of Ο(n3), you probably also don't want to wait for it.

The driver will start a separate process for each graph / layout. (This is slow, but this project is research and not industry.) To prevent the graph-processing tools from consuming an inordinate amount of system resources while still allowing the driver which orchestrates them to run for extended periods of time, some environment variables are honored by all command line tools in the ./src/ directory to control the resource limits. They all have the form MSC_LIMIT_${RES} and will be explained later in this document.

Dependencies

The project uses a CMake build system. You need at least CMake 3.9.1 or newer in order to build anything. There are no install targets, everything is built locally because the artifacts are not really meant to be used stand-alone.

The heart of the graph algorithms is implemented in C++ using the OGDF library, which will be automatically downloaded and (after verifying the integrity of the downloaded file) built as part of building this project. It is therefore not considered as an external dependency.

The project uses (at the time of this writing) fairly cutting-edge C++17 technology so a modern C++ compiler is required. GCC 8.1.0 was tested and proven to work.

Besides, the Boost C++ libraries are required in version 1.66 or newer. Apart from the header-only components, the following compiled Boost libraries are required: Filesystem, Iostreams, Program Options and System. If you don't have a Boost version recent enough installed, you might find the script ./maintainer/get-boost handy, that will download, build and install the required Boost libraries in a number of configurations for you. The integrity of the downloads is checked. The libraries will be installed locally for the current user so no root privileges are required and existing system libraries are not messed with.

The framework is glued together with a non-trivial amount of Python code, referred to as the driver. It requires Python 3.6.2 or newer.

The following Python packages are required:

The following Python packages are optional:

The keras package will only function if you have TensorFlow installed which has to be done via the usual procedures applicable to your operating system.

The web front-end and other presentation targets will only work if Gnuplot is installed.

Further optional dependencies are Doxygen to generate API documentation for the C++ code, and a TeX toolchain (in particular, lualatex, biber or bibtex and makeindex) for typesetting the written thesis and presentation slides. The TeXLive 2018 distribution was tested and is known to work.

In order to typeset the slides, the KIT's beamer theme must be available. KIT members can obtain it from the KIT's intranet. Unfortunately, this service is not available to the general public. Please consult the section titled “Typesetting Documents” for a discussion of some workarounds.

Downloading, Configuring, Building and Testing the Software

The easiest way to obtain the software is to clone the GitHub repository.

$ git clone 'https://github.com/5gon12eder/msc-graphstudy.git'  # clone into directory 'msc-graphstudy'

Alternatively (for example, if you don't have Git), you may use the option provided by GitHub to download the current master branch as a single ZIP archive. This won't give you any version control information and is therefore a smaller download.

$ wget 'https://github.com/5gon12eder/msc-graphstudy/archive/master.zip'
$ unzip master.zip  # extract into directory 'msc-graphstudy-master'

Once all dependencies are available and the source code has been downloaded and (if necessary) unpacked, the project can be built using the usual CMake commands. In the most simple case, running

$ cmake .
$ cmake --build .

in the top-level directory should be sufficient. It might be a better idea (and is highly recommended for any serious hacking) to use out-of-tree builds, though. Once the project is configured and built, the tests can be run via

$ ctest -T Test

with additional CTest flags added as you see fit.

If you want to hack on the project, you might find it convenient to build different configurations in which case the ./maintainer/configure and ./maintainer/build scripts could be useful (please consult their --help output).

Automatic File Downloads

During the build process, a number of files will be downloaded from the internet. This means that your computer will have to be connected to the internet. There is currently no support for specifying any proxy settings.

The following resources will be downloaded:

  • The OGDF which will be compiled and used in executed code.
  • The LaTeX file llncs.cls and the BibTeX file splncs04.bst for Springer's “Lecture Notes in Computer Science” which will be used for typesetting the GD'18 paper (only if the paper target is built).
  • Various non-executable data such as example graphs or pictures.

For each downloaded file, the SHA256 checksum is verified and the file is only used if the checksums match. Apart from the OGDF (which is downloaded by CMake via its externalproject_add feature), all downloads are performed by the script ./utils/download.py (which you might find useful, too).

Download Cache: If the environment variable MSC_CACHE_DIR is set (to the absolute path of an existing directory) then the download script will use it as a download cache. If asked to download a file with expected ${algo} hex-encoded checksum ${hash} it will first check for a file ${MSC_CACHE_DIR}/downloads/${algo}-${hash}.oct. If that file exists and has the correct checksum, it will be used and no download will be attempted. Using a download cache is recommended in particular if you build multiple configurations next to each other as it will reduce the number of downloads and allow for offline builds once the cache is populated.

Download Trace: The download script also honors the environment variable MSC_TRACE_DOWNLOADS which can be set to the absolute path of a file to which the script will append one line per (attempted) download. Each line in the file will be a JSON object holding the following keys.

  • url (string) — URL for which a download was attempted
  • date (string) — timestamp when the download was started (in RFC-5322 format)
  • time (real) — elapsed time in seconds (only present if the download was successful)
  • size (integer) — size of the downloaded file in bytes (only present if the download was successful)
  • error (string) — informal error message (only present if the download failed)
  • digest-${algo} (string) — computed ${algo} checksum (e.g. digest-sha256) of the downloaded file as a hexadecimal string (only present if the download was successful)

Please note that the file as a whole is not a valid JSON document but can be transformed into one by adding a comma after each but the last line and wrapping its entire content between “[” and “]”.

This trace log is not used by the build system but might help you reason about the downloads that were performed.

Additional Targets

The default CMake target will only build the C++ tools. In order to run any experiments, you have to build the respective targets explicitly. The following targets might be useful.

  • deploy — populates the database and runs the main experiment
  • httpd — starts a HTTP server listening at port 8000 providing a web front-end to the database
  • eval — runs cross validation and other tests (also note the targets eval-cross-valid, eval-puncture, eval-puncture-excl, eval-puncture-incl and eval-clean)
  • integrity — checks the database for inconsistencies
  • integrity-fix — checks the database for inconsistencies and tries to fix them (you shouldn't run into this unless you go messing with the database by manually deleting / adding files or executing SQL statements)
  • apidoc — builds Doxygen API reference documentation
  • benchmark — runs some benchmarks (there are not many of them)
  • report — typesets the written thesis in ./report/graphstudy.pdf (also note the target report-clean)
  • slides-kit — typesets the slides for the presentation given at the KIT on April 24 2018 in ./slides_2018-04-24_kit/graphstudy.pdf (also note the target slides-kit-clean)
  • slides-gd18 — typesets the slides for the presentation given at GD'18 on September 26 2018 in ./slides_2018-09-26_gd18/graphstudy.pdf (also note the target slides-gd18-clean)
  • paper — typesets a preliminary version of the paper submitted to GD'18 in ./paper/graphstudy-gd18.pdf as well as an extended version (as submitted to Arxiv) in ./paper/graphstudy-arxiv.pdf (also note the targets paper-cache, paper-pubar and paper-clean)
  • test — exercises all tests
  • maintainer-everything — builds all of the above targets except httpd

The deploy target will take a very long time to build (probably several days). The eval target will take a long time, too (probably several hours). Building the eval target will run the experiments again each time. If the default setting makes you nervous whether the job is actually making any progress at all, you might want to increase the verbosity by setting the environment variable MSC_LOG_LEVEL to INFO (the default is NOTICE).

Before you go ahead building the deploy and eval targets, you might want to read the section about driver configuration first.

The httpd target starts a local web server listening on port 8000 and serves visualization and other useful insights into the current data base. (Check it out!) It won't show any results of the eval runs, though. These are only stores in various JSON files in the ./eval/ directory. The server process automatically forks to the background. In order to shut it down again, send it SIGINT. Its process ID will written to a file ${bindir}/.httpd.pid and also printed at startup. If you find a server still running and cannot figure out which process it is, visit http://localhost:${port}/about/ and look for a line that says “Process-ID”. The default ${port} is 8000. Once you know the process ID ${pid} shut the server down like so:

$ kill -s INT ${pid}

The eval target runs all available evaluation experiments while the eval-cross-valid target only runs the normal (full) cross-validation while the eval-puncture target will only run the (reduced) cross-validation experiments for “punctured” feature vectors. The eval-puncture-excl and eval-puncture-incl targets both are a sub-set thereof, running said experiments only for the case of sole exclusion or inclusion of a single property respectively. The eval-clean target deletes the results of any previous experiments. Beware that the eval as well as all of the other mentioned eval-* targets will start by deleting all previous results as if eval-clean were built beforehand. Please note that in the case of the eval-* targets which only build a sub-set of the complete eval target, this means that previous results will be deleted but no recreated. The takeaway is that you should think carefully before building eval or any of the eval-* targets – even more so given that they will take a long time to complete.

If the environment variable MSC_EVAL_PROGRESS_REPORT is set (to an absolute file name) the current progress and estimated time remaining for the evaluation will be appended to it in a format suitable as input for Gnuplot.

Overview

The CMake targets described in the previous section provide access to the most high-level actions you might want to perform. For more explicit tasks, direct interaction with the software will be required. The following sections describe how to use the most important components in this project.

A Fair Word of Warning: This code is a research project and not industrial software! While we try to hold its quality up to good standards, the primary reason for this software to be written was so it can be studied. Please do not expect this project to be ready-to-use for any use other than experimenting. Given this, it is hard to draw a line between “user interface” and “implementation details” for this software. You can repeat the experiments for which results were published (and even recreate the respective documents) at the push of a button as described above but be prepared to get confronted with lots of details and internals as soon as you try digging any deeper. That said, you are explicitly invited to do so. This project was made public so others can study our work and ideally benefit from it.

Concepts

This project uses the following terminology.

  • graph — set of nodes and edges (graph theory)
  • layout — mapping of vertices to two-dimensional coordinates for a given graph
  • property — multi-set of real numbers computed for a given layout
  • metric — single scalar number computed for given a layout
  • discriminator — function that takes two layouts and outputs a number indicating an aesthetic preference
  • fingerprint — fixed-length deterministic value computed for a graph or layout assumed to be practically unique

The source code frequently refers to discriminators as “tests” for historical reasons. This should be cleaned up over time.

Project Layout

This project consists of four major components:

  • A collection of command line tools in the ./src/ directory. These tools are written in C++.
  • A driver which can be used to populate a database with a collection of graphs and layouts, compute properties for them and build and train the discriminator model as well as competing metrics. The driver can also run as a local HTTP server to provide insights in the current data. The diver is found in the ./driver/ directory. The driver is written in Python with some XSLT, CSS and JavaScript technology for the web-interface.
  • The directory ./eval/ contains a set of experiments designed for evaluating the model. Those are implemented using a mix of Python and CMake scripting.
  • The directories ./report/, ./paper/ and ./slides_* contain sources for various publications and presentations about the project. The build system utilizes the components mentioned above in order to obtain numbers and figures for those publications.

Directories

Conceptually, this project uses the following directories.

  • The source directory is the root of the source code tree. We'll refer to this directory as ${srcdir} or “.” in the following. The driver always assumes ${srcdir} is the current working directory. In other words, you must always invoke the driver script from within the top-level source directory.
  • The build directory is the root of the build tree. We'll refer to it as ${bindir}. For an out-of-tree build, you may choose this directory freely. CMake will know about this choice so there is no need to mention it when building the high-level targets described earlier. However, when invoking the driver script directly, you need to provide the location of this directory. (It accepts the --bindir=${bindir} option for that purpose.)
  • The configuration directory is the directory where the driver ought to look for configuration files. We shall refer to it as ${configdir}. The default for ${configdir} is ${srcdir}/config/ but this may be overruled by setting the CMake variable MSC_CONFIG_DIR. (You either do this by passing the -DMSC_CONFIG_DIR=${configdir} option when you invoke cmake or via editing the configuration interactively using the ccmake tool.) The driver can be told about this directory via the --configdir=${configdir} option.
  • The data directory is the root of the tree where the database files are stored. We'll refer to it as ${datadir}. The default ${datadir} is ${bindir}/data/ but this can be changed via setting the CMake variable MSC_DATA_DIR and communicated to the driver using the --datadir=${datadir} option.
  • The cache directory is a directory where the driver stores non-essential data. It is the only directory that may be undefined. Its has no default and its value is automatically picked up by the driver from the environment variable MSC_CACHE_DIR. You may delete that directory safely if it grows out of hand.

Command Line Tools

After building the project, the src directory will (in sub directories) contain a number of command line tools that are subsequently invoked by the driver script but may also have merits on their own. The tools are structured by their purpose in the following sub-directories.

  • generators — These tools take “nothing” and output a graph. The directory contains probabilistic graph generators and a program to “import” graphs from a variety of formats.
  • layouts — Layout algorithms; these tools take a graph and output a layout for it.
  • unitrans — Unary layout transformations; these tools take a layout and output another layout (or multiple if multiple rates are specified).
  • bitrans — Binary layout transformations; these tools take two layouts and output another layout (or multiple if multiple rates are specified).
  • properties — These tools take a layout, compute some property of it and output some data.
  • metrics — These tools take a layout, compute some metric for it and output some number.
  • visualizations — These tools take a layout and output a drawing (an image file of some sort).
  • utility — These tools do various things.

The sub-directory common contains no programs but C++ code that is shared by all command line tools. The static library is called libcommon.a when built. It contains a grab-bag of features needed for this project and is not intended for use by third-party code although you might find it useful to take individual components out of this library and use them elsewhere (obeying the requirements imposed by software license, of course).

All command line tools accept input files as positional arguments, reading from standard input if none is provided. If they produce any output, they write it to standard output or to the file specified via the --output option. All tools also have the ability to output “meta” information to a file specified via the --meta option (the default is to not output such information). Meta information will always be in JSON format and, unfortunately, you have to run the program and see what it outputs as the structure is not documented. Despite its name, this information is often not at all “meta” but the most essential thing the tool produces. In essence, everything that will be processed by another tool or program will be considered “output” and everything that is of interest to the driver script will be considered “meta”. The driver script treats output files as opaque, managing but not interpreting them.

To give you an idea how the tools might be used, consider the following example

$ mosaic --symmetric --nodes=1000 --output=sample.xml.bz2
$ picture --output=sample.svg sample.xml.bz2

which creates a random symmetric “mosaic” graph with approximately 1k nodes and save it as GraphML file sample.xml.bz2 with bzip2 (Burrows-Wheeler) compression applied. The file is then read again on the second line and a graphical rendition of the layout is saved as file sample.svg.

The shell pipeline

$ wget -q -O - 'ftp://math.nist.gov/pub/MatrixMarket2/Harwell-Boeing/bcspwr/bcspwr01.mtx.gz'   \
      | import --format=matrix-market --simplify --meta=2 STDIO:gzip                           \
      | force --algorithm=fmmm                                                                 \
      | edge-length --kernel=boxed --output=histogram.txt
{ "nodes": 42, "edges": 49, "graph": "f549a2236f459c8c6ea7bb28a7884f31", "native": false, ... }

downloads (using the standard wget command line utility) a graph from NIST's “Matrix Market” as gzip (Lempel-Ziv) compressed file, “simplifies” the graph (This operation makes edges undirected and deletes loops and fuses multiple edges into a single one.) and converts it to the preferred GraphML format, then computes a force-directed layout for the graph and finally analyzes the distribution of edge lengths in that layout, saving a histogram as text file histogram.txt. The command given the --meta=2 option will print additional information to standard error output (selected by the POSIX file descriptor 2) in JSON format which is partially shown in the above snippet after the command prompt. (It cannot be printed to standard output which is already used for the pipeline or it would be invisible and clobber the graph data.) The histogram file could be plotted using a tool like gnuplot. The last program could (and probably should) also be instructed to output additional information like the mean or entropy in JSON format using the --meta option again which was omitted in the example to avoid confusion. The value for the graph key in the shown JSON output is the fingerprint computed for the imported graph.

Please note that both of the above examples omit the directory part of the invoked programs for brevity.

A complete list of all tools is omitted here, please go look at the directories yourself. All of these tools accept a consistent set of command line options. Please run any tool with the --help option to see what options and arguments it accepts / requires.

File Names

Whenever a tool accepts a file name, it will also accept a decimal file descriptor. This is especially useful if you invoke a tool from another process and need more than one pipe to communicate all the data back and forth. Furthermore, the strings NULL and STDIO receive special treatment. The former will be understood as a request to perform no I/O at all (somewhat like using /dev/null) and the latter causes standard input or output to be used. The empty string has the same effect as NULL and the string “-” has the same effect as using the string STDIO. The reason the more verbose alternatives are provided as well (and are, in fact, recommended) is that the Boost Program Options library that is used for parsing the command line arguments will behave erratically in some cases when confronted with the empty string or the string “-”. It follows that if you should ever want to refer to a regular file by such a name, you will have to use a construct like ./NULL or ./42 to circumvent this special treatment.

Apart from the remarks above, file names also support a special syntax to enable transparent compression. If the file name contains a colon, the portion after the last colon will be interpreted as a compression algorithm. For example, the “file name” file.dat:gzip refers to the file name file.dat which should be accessed using gzip compression. Beware that some operating systems use colons in regular file names. In that case, you must always append another colon at the end. For example, A:graphs\mansion.dat will cause confusion while A:graphs\mansion.dat: will work fine. Using the empty string otherwise has the same effect as not specifying any compression at all and will cause the compression to be inferred from the file name. If it ends in .gz it will be assumed that the file is gzip compressed and if it ends in .bz2 then bzip2 compression will be assumed. If the compression is specified explicitly, the strings gzip, bzip2 and none are accepted and have the obvious meaning. If you like being verbose, the string automatic may also be used instead of the empty string to the same effect.

Environment Variables

All command line tools that may use non-determinism honor an environment variable MSC_RANDOM_SEED which, if set, will be used to act as a deterministic seed for the pseudo random number generator. It may be any sequence of bytes. If this variable is not set, the program will behave non-deterministic.

Warning: The MSC_RANDOM_SEED variable is currently only honored by the command line tools but not by the driver. This has the consequence that if you set it globally (such as by putting MSC_RANDOM_SEED="f~rR9>Zh-1t'MxzVa<nb" in your ~/.profile file) and run the driver script, its graph generation process will livelock, invoking the same generator tool over and over again, rejecting its output because the graph is already in the database. Making the driver support a global random seed to make the whole process deterministic is an important but non-trivial open task. (It is not as trivial as seeding a pseudo random generator in the driver and using it to generate deterministic but different seeds for each tool invoked because the driver may be interrupted and in this case, the generator would have to pick up where it left off.)

Some environment variables are accepted by all command line tools and control their resource limits. They all have the form MSC_LIMIT_${RES} where ${RES} is one of the constants defined by the POSIX rlimit interface (spelled in all upper-case). Consult the manual page for the getrlimit or setrlimit system calls for that purpose. For example, running a command line tool with MSC_LIMIT_STACK=33554432 will cause the process to use a maximum stack size of 32 MiB, unless a lower hard-limit is already imposed in which case that limit will take effect. The special value of NONE is interpreted as a request to clear any soft-limit that might be in effect for the resource (using resources up to the active hard-limit, if any). This feature is only available on POSIX platforms. On other systems, those environment variables cannot be honored and setting them will cause an error.

The phantom tool is also sensitive to the environment variable MSC_DUMP_PHANTOM which, when set, will be interpreted as a file name into which to dump the “phantom” graph.

The princomp tool uses the value of the environment variable MSC_PRINCOMP_ORTHO_TOL – which should be a small positive floating-point number ε > 0 to decide whether its results should be discarded because p1 ⋅ p2 > ε where p1 and p2 are the determinded first and second principal axes respectively. The default value is ε 2−10 if the environment variable is not set.

Finally, all command line tools honor the COLUMNS environment variable in case that a syscall to determine the terminal width is not available or does not succeed in order to determine the width of the --help output. This variable should be set to a positive integer (your shell might do this automatically).

The Driver

So far, we have been talking about “the driver script”. This was a lie. The driver is not a single script but a Python package with plenty of modules, some of which are executable. So in order to invoke a certain driver module ${module} you would execute the command

$ python3 -m driver.${module} ...

from within ${srcdir} passing any options and arguments the module expects instead of the ellipsis shown above.

The following driver modules are available:

  • deploy — populates the database
  • httpd — runs an HTTP server for the web-interface
  • compare — queries the discriminator model about pairs of layouts
  • doctests — runs Python doctests for the driver
  • archidx — prints statistics about graph archives
  • integrity — checks integrity of the database and can try to fix them
  • model — allows access to internals useful for preparing documents

Warning: The integrity module is not very well tested and might have fatal bugs. Be sure to have a backup of your (already corrupted) data before you screw it completely.

Invocation

As everything in this project, the driver modules all support a --help option which will cause them to print a short help text and then exit immediately. Please make use of it to get detailed information about the arguments and options a driver module expects.

The following options are supported by all driver modules.

  • -B, --bindir=${bindir} — root of the build directory where to find executables (default: .)
  • -C, --configdir=${configdir} — search for configuration files in ${configdir} (default: config)
  • -D, --datadir=${datadir} — root of the data directory (can be created, default: data)
  • -v, --verbose — increase the logging verbosity by one level (may be repeated and combined)
  • -q, --quiet — decrease the logging verbosity by one level (may be repeated and combined)
  • --log-level=${level} — set the logging verbosity to one of the well-known syslog levels (by default, the value of the environment variable MSC_LOG_LEVEL is used which in turn defaults to NOTICE
  • --help — show usage information and exit
  • --version — show version information and exit

If an argument is shown for the long form of an option, the short form of the option accepts that same argument, too.

Configuration

This section is about setting up the experiment, not build system configuration.

The configuration read by the driver script is found in the ./config/ directory. The Git repository contains two configurations ./config-light/ and ./config-heavy/ with the former intended for a quick check and the latter for a thorough experiment. ./config is a symbolic link to ./config-light/. The files in ./config-heavy/ and ./config-light/ might serve as a good starting point for writing your own configuration.

All *.cfg files have in common that blank lines are ignored as well as everything after a “#” character.

Graphs (graphs.cfg)

The file graphs.cfg controls the graphs that will be generated. Its format is a table where the first column lists the graph generator and the subsequent columns the desired number of graphs per size class. All columns except the first (which must not have a title) must have a title that specifies the size class. For example, the following configuration

                    SMALL     MEDIUM
LINDENMAYER         10        5
ROME                20        *

specifies that 10 small and 5 medium-sized LINDENMAYER graphs are desired as well as 20 small graphs from the ROME collection and any medium graphs in there. Using an asterisk makes only sense for generators that import from a finite collection. If you build the deploy target over and over again, you might get tired from the driver scanning the graph archives each time. Setting the MSC_QUICK_ARCHIVE_IMPORT environment variable to a positive integer will cause the driver to take a short-cut and assume that the graphs that are currently in the database are all that can be found in the archive and not scan it again.

Warning: The driver parses the “table” in this file by interpreting each line as a list of tokens (one per column). The offset inside the file does not matter. Therefore, you cannot leave table cells empty. It is recommended that you format the file with aligned columns as a table to improve human readability but doing so is not required as far as the driver is concerned.

The following graph sizes are defined. A graph with n vertices belongs to a given size class if and only if nminn < nmax is true.

Size Class nmin nmax
TINY 0 10
SMALL 10 100
MEDIUM 100 1,000
LARGE 1,000 100,000
HUGE 100,000

The size classes are specified by the enumerator GraphSizes which is defined in the file ./driver/constants.py. In case of doubt, please refer to this definition and consider the information provided by the table above as outdated.

The following graph generators are defined:

  • SMTAPE — imports graphs from the SMTAPE set of the Harwell-Boeing collection in NIST's Matrix Market
  • PSADMIT — imports graphs from the PSADMIT set of the Harwell-Boeing collection in NIST's Matrix Market
  • GRENOBLE — imports graphs from the GRENOBLE set of the Harwell-Boeing collection in NIST's Matrix Market
  • BCSPWR — imports graphs from the BCSPWR set of the Harwell-Boeing collection in NIST's Matrix Market
  • RANDDAG — imports graphs from the RANDDAG collection hosted on graphdrawing.org
  • NORTH — imports graphs from the NORTH collection hosted on graphdrawing.org
  • ROME — imports graphs from the ROME collection hosted on graphdrawing.org
  • IMPORT — imports graphs (and optionally native layouts) from arbitrary user-defined sources specified in the user-provided ${configdir}/imports.json configuration file
  • LINDENMAYER — probabilistic algorithm creating graphs with native layouts utilizing a stochastic L-system
  • QUASI3D — probabilistic algorithm creating graphs with native layouts from a random projection of a regular 3-dimensional lattice onto a 2-dimensional plane
  • QUASI4D — probabilistic algorithm creating graphs with native layouts from a random projection of a regular 4-dimensional lattice onto a 2-dimensional plane
  • QUASI5D — probabilistic algorithm creating graphs with native layouts from a random projection of a regular 5-dimensional lattice onto a 2-dimensional plane
  • QUASI6D — probabilistic algorithm creating graphs with native layouts from a random projection of a regular 6-dimensional lattice onto a 2-dimensional plane
  • GRID — probabilistic algorithm creating graphs with native layouts as regular n × m girds
  • TORUS1 — probabilistic algorithm creating graphs as regular n × m girds connected to form a 1-torus (a cylinder)
  • TORUS2 — probabilistic algorithm creating graphs as regular n × m girds connected to form a 2-torus (a doughnut)
  • MOSAIC1 — probabilistic algorithm creating graphs with native layouts by recursively splitting the facets of an initial regular polygon.
  • MOSAIC2 — like MOSAIC1 but the amount of randomness is reduced to produce more symmetric results.
  • BOTTLE — probabilistic algorithm creating graphs with native layouts as axonometric projections of 3D-meshes of random bodies of revolution
  • TREE — probabilistic algorithm creating random trees
  • RANDGEO — Generates a random geometric graph using a procedure similar to the one presented by Markus Chimani at GD'18

These constants are specified by the enumerator Generators which is defined in the file ./driver/constants.py.

Download Cache: Downloaded files can be cached locally in a directory specified by the environment variable MSC_CACHE_DIR or the the system's default temporary directory (/tmp/ on POSIX) if said variable is not set. If you set the environment variable to the (absolute) path of a directory that (unlike /tmp/) won't be wiped routinely, the driver will download archives only once. Repeated downloads not only slow down the graph generation process, they also put unnecessary load on the servers of your fellow researchers who are kindly providing the archives to the public at no charge. In the worst case, a server operator might consider your repeated download attempts abusive and blacklist your IP address. Finally, using a cache also enables you to work offline without any internet connection (once the cache is populated, that is, of course).

Layouts (layouts.cfg)

The file layouts.cfg specifies the desired layouts to compute. Its format is a list of layout algorithms followed by a list of graph sizes for which the algorithm should be applied. For example, the configuration

NATIVE  ...
FMMM    ... LARGE
STRESS  TINY SMALL

specifies that native layouts should be “computed” for graphs of all sizes (the ellipsis) while FMMM layouts should be computed for graphs up to and including LARGE size and STRESS layouts should be computed for TINY and SMALL graphs only. The ellipsis can be used in three ways. If it is used alone, it selects all size classes. If it is used as the first or last word in a row, it refers to all size classes up to (and including) or all size classes from and above the following or preceding class respectively. If an ellipsis is used between two other size classes, it selects all size classes in between.

The following layout algorithms are defined:

  • NATIVE — this is not a layout “algorithm” but merely a request to use the “native” layout (if any) provided by the graph generator
  • FMMM — fast Multipole Multilevel layout algorithm
  • STRESS — energy-based layout using stress minimization
  • DAVIDSON_HAREL — Davidson-Harel layout algorithm
  • SPRING_EMBEDDER_KK — spring-embedder layout algorithm by Kamada and Kawai
  • PIVOT_MDS — pivot MDS (multi-dimensional scaling) layout algorithm
  • SUGIYAMA — Sugiyama's layout algorithm
  • RANDOM_UNIFORM — garbage layout algorithm assigning independent random coordinates (drawn from a uniform distribution) to each vertex
  • RANDOM_NORMAL — garbage layout algorithm assigning independent random coordinates (drawn from a normal distribution) to each vertex
  • PHANTOM — garbage layout algorithm using the coordinates of a force-directed layout computed for a random “phantom” graph (which has the same number of nodes and edges)

These constants are specified by the enumerator Layouts which is defined in the file ./driver/constants.py.

Interpolated (interpolation.cfg) and Worsened (worsening.cfg) Layouts

The files interpolation.cfg and worsening.cfg control the computation of “interpolated” and “worsened” layouts respectively. Their format is the same: a list of interpolation or worsening algorithms followed by a list of numbers (between 0 and 1) that select the rates at which interpolations or worsenings should be computed using that algorithm. For example, the configuration

PERTURB  0.15  0.25
MOVLSQ   0.10

specifies that worsened layouts using the PERTURB algorithm shall be computed at rates of 15 % and 25 % while worsened layouts using the MOVLSQ algorithm should be computed only for a rate of 10 %.

The following layout interpolation algorithms are available:

  • LINEAR — uses simple linear interpolation between vertex coordinates
  • XLINEAR — like LINEAR but tries to reduce paradox effects by aligning the principal exes of the two parent layouts beforehand

These constants are specified by the enumerator LayInter which is defined in the file ./driver/constants.py.

The following layout worsening algorithms are available:

  • FLIP_NODES — flips the coordinates of randomly selected pairs of nodes
  • FLIP_EDGES — flips the coordinates of randomly selected pairs of adjacent nodes
  • MOVLSQ — deforms the entire drawing using affine transformations based on moving least squares as described by Schaefer et al.
  • PERTURB — adds white noise to vertex coordinates

These constants are specified by the enumerator LayWorse which is defined in the file ./driver/constants.py.

Properties (properties-disc.cfg and properties-cont.cfg) and Metrics (metrics.cfg)

The properties-disc.cfg, properties-cont.cfg and metrics.cfg files use the same format as the layouts.cfg file only that they specify (discrete or continuous) properties and metrics rather than layouts to be computed. For example, the configuration file

ANGULAR      SMALL ... HUGE
EDGE_LENGTH  MEDIUM

specifies that the ANGULAR property shall be computed for all layouts of graphs from small to huge size (both inclusive) and the EDGE_LENGTH property shall be computed for layouts of medium size graphs only.

The following properties are available:

  • RDF_GLOBAL — pairwise distances between nodes
  • RDF_LOCAL — pairwise distances between nodes separated in the graph no further than a given threshold
  • ANGULAR — angles between edges incident to the same node
  • EDGE_LENGTH — edge lengths
  • PRINCOMP1ST — node coordinates along the first (major) principal axis
  • PRINCOMP2ND — node coordinates along the second (minor) principal axis
  • TENSION — quotients of Euclidian distance in the layout and graph-theoretical distance between vertices

These constants are specified by the enumerator Properties which is defined in the file ./driver/constants.py.

The following metrics are available:

  • STRESS_KK — stress as defined by Kamada and Kawai with a desired edge length of 100
  • STRESS_FIT_NODESEP — stress as defined by Kamada and Kawai with a desired edge length chosen to minimize the result value
  • STRESS_FIT_SCALE — stress as defined by Kamada and Kawai with a desired edge length of 100 computed after scaling the layout homogeneously to minimize the result value
  • CROSS_COUNT — number of edge crossings
  • CROSS_RESOLUTION — minimal angle between any two intersecting edges
  • ANGULAR_RESOLUTION — minimal angle between any two edges incident to the same node
  • EDGE_LENGTH_STDEV — standard deviation of edge lengths

These constants are specified by the enumerator Metrics which is defined in the file ./driver/constants.py.

Note that the HUANG discriminator cannot work unless the CROSS_COUNT, CROSS_RESOLUTION, ANGULAR_RESOLUTION and EDGE_LENGTH_STDEV metrics are available.

Punctures (puncture.cfg)

The puncture.cfg file is simply a list (one item per line) of properties that should be deliberately set to zero when training and testing the model. For example, the configuration

RDF_GLOBAL
RDF_LOCAL

will cause the entries relating to the properties RDF_GLOBAL and RDF_LOCAL (continuous or discrete) to be “punctured” form the feature vectors. This is really only useful for experimenting and the environment variable MSC_PUNCTURE may be used to avoid puncturing by accident. The variable is expected to be set to a non-negative integer specifying the number of punctured properties. In the above example, you would set MSC_PUNCTURE=2 when running the driver. It will then check that there are indeed exactly two properties punctured and trigger an error if the active configuration disagrees. If the environment variable is unset, no check can be performed an a warning will be printed.

See above for a list of available properties.

Import Sources (imports.json)

There is one last configuration file, imports.json that, as its name suggests, is in JSON format. This file specifies where to find import sources for graphs (for the IMPORT generator). Its format is either a single JSON object or a JSON array of JSON objects. Using a single object is just a convenience for using an array with only one element. Each object specifies one archive which will be considered in turn. It is therefore most useful to specify multiple import sources only in conjunction with selecting to import all available graphs (using the * in the graphs.cfg file). An import source must always have a type. The remaining fields permissible (and some required) in the JSON object depend on the type of the archive. The following types are defined and accept or require the attributes mentioned as nested items. Items with no default value mentioned are mandatory.

  • DIR — refers to a local directory

    • type (string) — must be the text DIR
    • directory (stringa,b) — specifies the directory (path) to scan for graph files
    • format (stringd) — specifies the file format used by the archive
    • compression (stringc, default: NONE) — specifies the compression (if any) applied to the files in the archive
    • pattern (string, default: *) — specifies a POSIX globbing expression by which to select files in the directory
    • recursive (boolean, default: false) — selects whether or not the directory shall be scanned recursively for graph files to import
    • layout (boolean, default: null) — specifies whether or not the graph files have an associated native layoute
    • simplify (boolean, default: false) — specifies whether or not to “simplify” the imported graphs by pruning multiple edges, loops and making the graph undirected
  • TAR — refers to a tarball specified via a URL

    • type (string) — must be the text TAR
    • url (string) — specifies the URL of the tarball
    • format (stringd) — specifies the file format used by the archive
    • compression (stringc, default: NONE) — specifies the compression (if any) applied to the files in the archive
    • cache (stringb, default: don't cache) — specifies a file name (not a full path) to use for caching the tarball locally which is useful if the URL is not a file://
    • checksum (string, default: don't verify checksum) — specifies the cryptographic hash algorithm and hex digest to expect for the tarball in the format ${algo}:${hash} where ${algo} is one of the strings understood by the hashlib module from the Python standard library and ${hash} is a hexadecimal encoding of the expected checksum
    • pattern (string, default: *) — specifies a POSIX globbing expression by which to select files in the directory
    • layout (boolean, default: null) — specifies whether or not the graph files have an associated native layoute
    • simplify (boolean, default: false) — specifies whether or not to “simplify” the imported graphs by pruning multiple edges, loops and making the graph undirected
  • URL — refers to a collection of URLs of individual graphs

    • type (string) — must be the text URL
    • urls (array of strings) — lists the URLs of the graph files to consider
    • format (stringd) — specifies the file format used by the archive
    • compression (stringc, default: NONE) — specifies the compression (if any) applied to the files in the archive
    • layout (boolean, default: null) — specifies whether or not the graph files have an associated native layoute
    • simplify (boolean, default: false) — specifies whether or not to “simplify” the imported graphs by pruning multiple edges, loops and making the graph undirected
    • name (stringb, default: www) — specifies an informal name for the archive and should be a valid identifier (\w+)
    • cache (boolean, default: false) — specifies whether or not to cache the downloaded files locally in a database
  • NULL — is a dummy archive that contains no graphs

    • type (string) — must be the text NULL (That's right: the string literal "NULL" as opposed to the special JSON value null.)

a) Environment variables can be expanded using shell syntax (e.g. ${HOME}/work/ might get expanded to /home/5gon12eder/work/; the curly braces may be omitted).

b) If the first character is a “~” it will be expanded to a user's home directory (e.g. ~/work/ might get expanded to /home/5gon12eder/work/ and ~foo/work/ to /home/foo/work/).

c) Acceptable values for the compression attribute are GZIP, BZIP2 and NONE with the obvious meaning. These constants are case-insensitive.

d) Acceptable values for the format attribute may be found by invoking the import tool (${bindir}/src/generators/import) with --show-formats option. As of this writing, the following formats were supported. The names are case-insensitive. Links to the official documentation of the format are provided where available. Please also refer to the reference documentation of the ogdf::GraphIO API.

  • BENCH
  • CHACO
  • DL — UCINET DL format
  • DMF — DIMACS Max Flow Challenge
  • DOT
  • GDF — GUESS Database File
  • GD_CHALLENGE — Graph Drawing Challenge: Area Minimization for Orthogonal Grid Layouts
  • GEXF — Graph Exchange XML Format
  • GML — Graph Modelling Language
  • GRAPH6 — the Graph6 format represents a (preferable dense or small) simple undirected graph as a string containing printable characters between 0x3F and 0x7E
  • GRAPHML — Graph Markup Language
  • LEDA — LEDA Native File Format for Graphs
  • MATRIX_MARKET
  • PLA
  • PM_DISS_GRAPH — graph file format from Petra Mutzel's PhD thesis
  • ROME — Rome-Lib format
  • RUDY
  • STP — SteinLib STP Data Format
  • TLP — Tulip software graph format
  • YGRAPH

e) If layout is set to true then all graphs must have an associated layout which will be treated as the native layout for the respective graph. If the archive contains multiple versions of the same graph, all but one will be discarded. If layout is set to false, only the graph data will be imported even if layout data would be available. Duplicate graphs will be discarded. If layout is set to the special value null (which is the default) then graphs are assumed to have no native layout but if a graph happens to have associated layout information, it will be imported and (later) treated as an unclassified layout. (Such layouts will not be used for training or testing and have no implicit quality assigned.) In this case, if the archive contains multiple versions of the same graph, all layouts (if any) will be imported.

Unlike vanilla JSON, the format used for the imports.json file allows you to use simple comments introduced by two consecutive forward-slashes as the first non-white-space characters on a line and will cause the entire line to be ignored. You can, therefore, write

{
    // "bar" : "Sorry, not today!",
    // foo is always good
    "foo" : 42
}

which will be interpreted as if you had written { "foo" : 42 } instead.

To get an idea, you might want to have a look at the file ./driver/resources/imports.json which provides the definitions for the well-known graph archives that are supported out-of-the-box. The one but important difference to be aware of is that this file specifies archives as the values in a JSON object (with the key being the name of the well-known import source) while the configuration file you may write is supposed to contain the archive specifications as the elements of a JSON array (with no keys) instead.

Environment Variables

Here is a summary of all environment variables that are honored by the driver.

  • MSC_MODEL_DEBUGDIR — If set to a directory, the driver will dump human-readable information about the built neural network into it.
  • MSC_NN_TEST_SUMMARY — If set to a file name, the driver will write to it a summary of how the various discriminators performed on the test data set in JSON format.
  • MSC_PUNCTURE — If set to a non-negative integer N, the driver will check that exactly N properties are punctured (see puncture.cfg).
  • MSC_LOG_LEVEL — If set to one of the well-known syslog levels, this defines the initial verbosity of the driver (may be altered by passing additional --verbose or --quiet options or overridden by the --log-level option).
  • MSC_CACHE_DIR — If set to a directory, the driver will use it to cache downloaded files and rendered pictures served by the local HTTP server.
  • MSC_QUICK_ARCHIVE_IMPORT — If set to a positive integer, graph archives won't be scanned for more graphs if the multiplicity “*” (see graphs.cfg) was selected for the desired number of graphs. Setting this variable to 0 has the same effect as not setting it at all and will cause the archive to be scanned.

The following environment variables can be used to specify external programs the driver will use.

First Choice Second Choice Default Value
MSC_GNUPLOT GNUPLOT gnuplot
MSC_IMAGE_MAGICK IMAGE_MAGICK convert
MSC_ZCAT ZCAT gzip -dc
MSC_BZCAT BZCAT bzip2 -dc

As the driver starts up, it will first check the environment variable mentioned as “First Choice” or, if that is not set, check the environment variable mentioned as “Second Choice” next or, if that isn't set either, use the value listed as “Default Value”. At verbosity INFO or higher, the driver will report the result of these checks. In case of doubt, the information printed there will be more up-to-date than this README document.

Once a value is obtained by the procedure described above, it will be tokenized according to the Shell rules and used thenceforth. The first token must be an absolute path (that is, use /usr/bin/zcat not zcat). For security reasons, the individual tokens must not require any escaping, otherwise, the driver will reject the command. (That is, you cannot have "/path/with spaces/foo" \(lol\) as a command. You must create a wrapper script if your system should happen to be this wicked.)

Typesetting Documents

This repository also includes the TeX source code for the written thesis, presentation slides and the paper submitted to GD'18. What is more, it includes a wealth of auxiliary scripts that automatically extract experimental results and format them for inclusion in the TeX file. Therefore, if you re-run the experiments (build the deploy and eval targets) and then typeset any of the documents, they will automatically contain the numbers corresponding to the results of your experiments. Please note that while doing this is kind of fun and might help you convince yourself that the published numbers are sound, you must not publish the documents you've typeset. Typesetting your own version with different inputs constitutes the creation of a derived work which the CC BY-NC-ND license does not allow you to redistribute. This restriction, unfortunately, is necessary not only for legal reasons but also in order to avoid multiple documents with slightly different numbers from circulating.

Typesetting either of the documents is very similar. On the CMake level, you build the report or paper target or the slides-kit and slides-gd18 targets. Each of these targets ${target} is accompanied by a ${target}-clean target which removes auxiliary files created by TeX for the respective targets. If you edit any TeX file and are unfortunate enough, removing those files might be required in order for TeX to be able to typeset the document successfully again, as any TeX'nician will have experienced from time to time.

The actual typesetting is accomplished by the ./utils/typeset.py script. This Python script creates a symlink farm to enable out-of-tree TeX builds and takes care of invoking the various TeX tools in the appropriate order and required number of times. By default, all documents draw plots and graph layouts directly in TeX using the tikz package. This is very cool but – unfortunately – takes a lot of time and a lot of memory. Since neither TeX, pdfTex or XeTeX (at least not from the TeXLive 2018 distribution) are capable of dynamically allocating the required amount of memory, the only TeX engine that will work out-of-the-box is LuaTeX. Therefore, the build system will typeset all documents using LuaLaTex which is pretty darn slow. Expect build times of several minutes or more.

On-Demand Downloads: The documents include some example graphs taken from public graph collections which will be downloaded on-demand when the documents are typeset. Unless the files are already available from a local cache, access to the internet will be required. See the section “Automatic File Downloads” for more information. Some documents may require additional downloads (other than graphs) that will be mentioned below.

TeX'nical Troubleshooting: If the environment variable MSC_TEX_REPORT_HTML is set (to an absolute file name), the ./utils/typeset.py script will write a report (in HTML format) to that file which contains a nicely formatted combination of all relevant log files. In case of an error, the HTML anchor #error-1 will land you on the first error (and so forth for any subsequent errors).

Typesetting Documents Without Experimental Results: Since building the eval target always runs the full experiments from scratch (which takes an enormous amount of time), it is not an explicit dependency of the report, paper and slides-* targets although you have to build it once before those targets can be built. As a cheataround, you may set the environment variable MSC_LAZY_EVAL_OKAY to a positive integer which will produce versions of the publications with dummy values substituted for the actual evaluation results.

Official Logos: In order to avoid copyright and trademark issues, no official logos are included in the repository. Instead, a transparent picture of the same size will be used. You can provide the absolute paths of those logos you have handy by setting the CMake variables MSC_LOGO_KIT, MSC_LOGO_ALGO and MSC_LOGO_IOSB accordingly. The logos must be in PDF format. (The ./maintainer/configure script will recognize an environment variable with the same name and pass its value on to CMake.)

Checking Assertions via TeX: Some (admittedly very few) textual claims can be verified by automatic assertions added to the LaTeX code via \directlua magic. In order to enable these checks, set the environment variable MSC_TEX_ASSERT to a positive integer. You should not combine this with setting MSC_LAZY_EVAL_OKAY as you cannot expect correct results when cheating in the first place.

Typesetting the Written Thesis

Building the report target will download the Roboto font from Google Fonts which is licensed under the Apache License, Version 2.0.

Furthermore, a photo will be downloaded from the Wikimedia Commons for inclusion in the typeset document. This file has been released into the public domain by its author.

Typesetting the GD'18 Paper

Building the paper target will download the two files llncs.cls and splncs04.bst which are copyrighted by Springer Verlag for its“Lecture Notes in Computer Science”. The latter file states that it is available under the LaTeX Project Public License distributed from CTAN archives in directory macros/latex/base/lppl.txt; either version 1 of the License, or any later version. The other file lacks a copyright notice but it is assumed that the same conditions apply.

There is the possibility to build a “cache” of those pictures that are resource hungry to render. The cache consists of a directory that contains pre-rendered PDF or EPS versions of the pictures that are thenceforth included as-is as opposed to be re-drawn from first principles each time the main TeX document is typeset. This feature is implemented using the externalize library of tikz. The cache can be built via the paper-cache target and removed again via the paper-clean target. If a cache exists, the pre-rendered pictures will be utilized automatically.

Warning: The cache will not be re-built as needed. If any TeX code is changed that might affect the rendering of a picture, it is necessary to build the paper-cache target again in order to update the cached pictures. Otherwise building the paper target will insert outdated pictures into the document. Unfortunately, there is no reliable way to track dependencies through TeX code.

There also is the paper-pubar target that creates an archive ${bindir}/paper/graphstudy.zip with the relevant files for the paper. Please note that this archive contains multiple files at the top-level as opposed to a single directory in which all actual files are found.

You will notice that some enumerators are spelled differently in the paper than in the other publications, the source code or this README document. Those renamings were applied driven by the desire to use constant names that are shorter and, in some cases, politically more correct. You can find the mappings of “source code” to “paper” names in files called ./paper/rename-*.txt all of which have the format of one renaming per line with the “source code” name on the left and the “paper” name on the right. Blank lines are ignored as well as everything following the first “#” character on a line.

Typesetting the Presentation Slides

The presentation slides built by the slides-kit and slides-gd18 target use the KIT's beamer theme. KIT members can obtain it from the KIT's intranet. Unfortunately, this resource is not available to the general public even though the files state that they are subject to the terms of the LaTeX Project Public License, either version 1.2 of this license or (at your option) any later version. Additional trademark restriction might apply. The files for this theme are neither included in the repository nor downloaded during the build process. Rather, the theme is assumed to be installed on the host system. If doing so is inconvenient to you, you may instead prepare a ZIP archive with the following files (not all of which might be strictly necessary).

KIT16.clo
KIT18.clo
KIT20.clo
KIT22.clo
KIT24.clo
KITcolors.sty
KITdefs.sty
KITmcfloat.sty
beamercolorthemeKIT.sty
beamerfontthemeKIT.sty
beamerinnerthemeKIT.sty
beamerouterthemeKIT.sty
beamerthemeKIT.sty
kit_logo_de_1c_schwarz.pdf
kit_logo_de_4c_positiv-rgb.pdf
kit_logo_de_4c_positiv.pdf
kit_logo_en_1c_schwarz.pdf
kit_logo_en_4c_positiv-rgb.pdf
kit_logo_en_4c_positiv.pdf

The archive must contain the files directly at the top-level – not within a sub-directory. Tell CMake (at configuration time) the location of this ZIP archive via setting the variable MSC_KIT_BEAMER_ZIP to the absolute path where the archive can be found and it will get extracted into the build directory as needed without clobbering the global TeX installation on your system. (The ./maintainer/configure script will recognize an environment variable with the same name and pass its value on to CMake.)

Tests

There are three types of tests:

  • Unit Tests test individual functions form the C++ support library libcommon.a. For each component ${foo} in that library (consisting of a header file ./src/common/${foo}.hxx and a source file ./src/common/${foo}.cxx and, optionally, a file ./src/common/${foo}.txx with inline C++ code), there exists a unit test file ./test/unit/${foo}.cxx which will be compiled into an executable ${bindir}/test/unit/test-${foo}. Invoking that executable with no arguments will run all unit tests for that component. A short message will be printed for each exercised test as well as a summary of the overall test harness. If a test fails, a more detailed error will be printed.

  • Doctests are the way unit testing is done in the driver script. They can be exercised by running the doctest driver module (i.e. python3 -m driver.doctests). It accepts a --help option that informs you about different ways to run the tests.

  • CLI Tests test the command line interface of a program. They are defined in the same CMakeLists.txt file that defines the executable. These tests are usually fairly trivial but make sure that the program can be invoked at all.

  • System Tests test the overall functioning of the driver script.

Either of the following two commands will exercise all of the above tests.

$ cmake -E chdir ${bindir} ctest -T Test   # using CTest
$ cmake --build ${bindir} --target test    # using CMake

Environment Variables

The following environment variables are honored by the unit test driver:

  • MSC_RANDOM_SEED — This environment variable may be set to an arbitrary byte sequence in order to make unit tests deterministic. Not all tests honor this variable at the moment, though.

  • MSC_TEST_ANSI_TERMINAL — Enables colorized output (using ANSI escape sequences) if set to a positive integer and disables it if set to 0. The unit test driver is currently not smart about figuring out whether the output terminal might support ANSI escape sequences or even is a terminal to begin with. If this variable is not set, no colorized output will ever be produced and if it is set, its value will be definitive.

  • MSC_TEST_ANSI_COLOR_SKIPPED — If colorized test output is enabled, the value of this environment variable will be interpreted as a decimal digit specifying an ANSI terminal color to use for skipped unit tests. The default value is 3 (yellow). If colorized test output is not enabled, this environment variable has no effect.

  • MSC_TEST_ANSI_COLOR_FAILED — If colorized test output is enabled, the value of this environment variable will be interpreted as a decimal digit specifying an ANSI terminal color to use for failed unit tests. The default value is 1 (red). If colorized test output is not enabled, this environment variable has no effect.

  • MSC_TEST_ANSI_COLOR_ERROR — If colorized test output is enabled, the value of this environment variable will be interpreted as a decimal digit specifying an ANSI terminal color to use for unit tests that encountered hard errors. The default value is 5 (purple). If colorized test output is not enabled, this environment variable has no effect.

  • CTEST_OUTPUT_ON_FAILURE (standard CTest setting) — If set to 1, the individual output of failed tests will be shown. Setting it to 0 turns on the default behavior of not showing the output of individual tests.

  • CTEST_PARALLEL_LEVEL (standard CTest setting) — If set to a positive integer N, run up to N tests in parallel. (Tip: You may consider export'ing CTEST_PARALLEL_LEVEL=$(nproc) in your ~/.profile to utilize all of your CPUs whenever you're running CTest.)

Several unit tests are sensitive to specific environment variables that will usually cause them to output some debugging information. You will recognize these environment variables when reading that unit test code which you will likely do after it failed and in this case, you'll understand what effect the variable has and why you might want to utilize it.

GNU Free Documentation License

Copyright (C) 2018 Moritz Klammler

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.

TODOs for this README Document

  • Explain how to find out what constants an enumerator has (driver)
  • Give an overview how the C++ code is organized
  • Give an overview how the driver code is organized
  • Write something about benchmarking (or delete the benchmarks sub-directory?)
  • Give pointers to potential contributors
  • Explain how to “bring your own” graph generator, layout / worsening / interpolation algorithm, property or metric
  • Can I also “bring my own” discriminator?

About

Source code for my master's thesis (Aesthetic Value of Graph Layouts: Investigation of Statistical Syndromes for Automatic Quantification) and subsequent GD'18 paper (Aesthetic Discrimination of Graph Layouts).

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C++ 38.6%
  • Python 25.3%
  • TeX 18.3%
  • CMake 8.6%
  • XSLT 4.4%
  • JavaScript 2.1%
  • Other 2.7%