ClusterGAN fork for climate data experiments.
This repository is mostly based on the repository ClusterGAN: A PyTorch Implementation which in turn is based on the Tensorflow implementation of ClusterGAN
This codebase is developed within the following environment:
python 3.6.9
pytorch 1.5.0
matplotlib 3.1.3
tqdm 4.47.0
numpy 1.18.1
seaborn 0.10.1
torchvision 0.6.0
In umap-project.py
, we also use GPU-enabled implementation of UMAP available in RAPIDSAI suite of open source software libraries and APIs. The line responsible for that is the following:
from cuml import UMAP
One can replace it with the standard implementation simply replacing the line with the following:
from umap import UMAP
in which case, one needs to ensure umap-learn
package is installed.
We narrowed the scope of the applications to MNIST only (so far). So the running of ClusterGAN on the MNIST dataset one may use the following command:
CUDA_VISIBLE_DEVICES=gpu_id python train.py --run-name=test_run --batch-size=64 --epochs=500 --num-workers=16 --snapshots-period=15 --latent-dim=32 --cat-num=10
where gpu_id
is an id of a GPU to use (for example, 0
);
As a result, a directory ./logs/run_name
will be made and contain the generated output (models, generated examples, training figures) for the training run.
Options for the script train.py
:
option | description |
---|---|
--run-name |
run name (str, default='test_run') |
--batch-size |
batch size (int, default=64) |
--epochs |
number of training epochs (int, default=200) |
--wass-metric |
Flag for Wasserstein metric (the flag is True if option is set) |
--num-workers |
Number of dataset workers (int, default=1) |
--snapshots-period |
period (in epochs) for saving the snapshots of models (int, default=-1 for saving final models only) |
--latent-dim |
real embeddings dimensionality (int, default=32) |
--cat-num |
categorical embeddings dimensionality, number of categories (int, default=10) |
Next, one may want to apply the encoder network to a data. This script may be useful in this case:
CUDA_VISIBLE_DEVICES=gpu_id python encode.py --run-name=test_run --snapshot-final --num-examples=60000 --batch-size=512 --dataset-train --dataset-workers=16
or the following:
CUDA_VISIBLE_DEVICES=gpu_id python encode.py --run-name=test_run --snapshot-stage=100 --batch-size=512 --dataset-workers=16
As a result, the encoded data file *.npz will be saved in the directory of the run, e.g., ./logs/test_run/mnist_train_stage-final_encoded.npz
or ./logs/test_run/mnist_train_stage-ep0135_encoded.npz
. The file contains three named arrays: zn
, zc_logits
, and labels
for the real embeddings part of the examples, the categorical part, and for the true labels of the examples.
Options for the script encode.py
:
option | description |
--run-name |
run name (str, default='test_run' ) |
--snapshot-final |
Flag for using the final snapshot of the models (default behaviour). This option is mutually exclusive with the option --snapshot-stage |
--snapshot-stage |
stage of training (epoch, int ) for loading models snapshot (most close one will be involved in case there is no snapshots for this exact epoch);-1 means the last one except final one. This option is mutually exclusive with the option --snapshot-final and --snapshot-all |
--snapshot-all |
encode dataset using all the snapshots made during training. This option is mutually exclusive with the option --snapshot-final and --snapshot-stage |
--num-examples |
number of examples (int; default is not set, in which case the whole dataset will be encoded) |
--batch-size |
batch size for the inference |
--dataset-train |
The flag indicating the encoding either of the training (if set) subset of data, or test subset (if not set, which is the default behaviour) |
--dataset-workers |
number of workers preprocessing dataset |
Next, one may want to apply UMAP to the encoded data. This script may be useful in this case:
python umap-project.py --run-name=test_run --file-path=./logs/test_run/mnist_train_stage-ep0135_encoded.npz
options for the script umap-cluster.py
:
option | description |
--run-name |
run name (str, default='test_run' ) |
--file-path |
encodings file path. This option is mutually exclusive with the option --files-all |
--files-all |
cluster using encodings of all available training stages (each stage separately). This option is mutually exclusive with the option --file-path |
As a result, the UMAP dimensionality reduction will be applied to the encodings, and the plot will be placed in the file, e.g., ./logs/test_run/umap-mnist_train_stage-ep0135_encoded.png
. Here is an example of this plot: