Skip to content

MKrinitskiy/ClusterGAN4Climate

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ClusterGAN fork for climate data experiments.

This repository is mostly based on the repository ClusterGAN: A PyTorch Implementation which in turn is based on the Tensorflow implementation of ClusterGAN

Requirements

This codebase is developed within the following environment:

python 3.6.9
pytorch 1.5.0
matplotlib 3.1.3
tqdm 4.47.0
numpy 1.18.1
seaborn 0.10.1
torchvision 0.6.0

In umap-project.py, we also use GPU-enabled implementation of UMAP available in RAPIDSAI suite of open source software libraries and APIs. The line responsible for that is the following:

from cuml import UMAP

One can replace it with the standard implementation simply replacing the line with the following:

from umap import UMAP

in which case, one needs to ensure umap-learn package is installed.

Run ClusterGAN on MNIST

We narrowed the scope of the applications to MNIST only (so far). So the running of ClusterGAN on the MNIST dataset one may use the following command:

CUDA_VISIBLE_DEVICES=gpu_id python train.py --run-name=test_run --batch-size=64 --epochs=500 --num-workers=16 --snapshots-period=15 --latent-dim=32 --cat-num=10

where gpu_id is an id of a GPU to use (for example, 0);

As a result, a directory ./logs/run_name will be made and contain the generated output (models, generated examples, training figures) for the training run.

Options for the script train.py:

option description
--run-name run name (str, default='test_run')
--batch-size batch size (int, default=64)
--epochs number of training epochs (int, default=200)
--wass-metric Flag for Wasserstein metric (the flag is True if option is set)
--num-workers Number of dataset workers (int, default=1)
--snapshots-period period (in epochs) for saving the snapshots of models (int, default=-1 for saving final models only)
--latent-dim real embeddings dimensionality (int, default=32)
--cat-num categorical embeddings dimensionality, number of categories (int, default=10)

Encoding examples

Next, one may want to apply the encoder network to a data. This script may be useful in this case:

CUDA_VISIBLE_DEVICES=gpu_id python encode.py --run-name=test_run --snapshot-final --num-examples=60000 --batch-size=512 --dataset-train --dataset-workers=16

or the following:

CUDA_VISIBLE_DEVICES=gpu_id python encode.py --run-name=test_run --snapshot-stage=100 --batch-size=512 --dataset-workers=16

As a result, the encoded data file *.npz will be saved in the directory of the run, e.g., ./logs/test_run/mnist_train_stage-final_encoded.npz or ./logs/test_run/mnist_train_stage-ep0135_encoded.npz. The file contains three named arrays: zn, zc_logits, and labels for the real embeddings part of the examples, the categorical part, and for the true labels of the examples.

Options for the script encode.py:

option description
--run-name run name (str, default='test_run')
--snapshot-final Flag for using the final snapshot of the models (default behaviour). This option is mutually exclusive with the option --snapshot-stage
--snapshot-stage stage of training (epoch, int) for loading models snapshot (most close one will be involved in case there is no snapshots for this exact epoch);
-1 means the last one except final one. This option is mutually exclusive with the option --snapshot-final and --snapshot-all
--snapshot-all encode dataset using all the snapshots made during training. This option is mutually exclusive with the option --snapshot-final and --snapshot-stage
--num-examples number of examples (int; default is not set, in which case the whole dataset will be encoded)
--batch-size batch size for the inference
--dataset-train The flag indicating the encoding either of the training (if set) subset of data, or test subset (if not set, which is the default behaviour)
--dataset-workers number of workers preprocessing dataset

UMAP dimensionality reduction

Next, one may want to apply UMAP to the encoded data. This script may be useful in this case:

python umap-project.py --run-name=test_run --file-path=./logs/test_run/mnist_train_stage-ep0135_encoded.npz

options for the script umap-cluster.py:

option description
--run-name run name (str, default='test_run')
--file-path encodings file path. This option is mutually exclusive with the option --files-all
--files-all cluster using encodings of all available training stages (each stage separately). This option is mutually exclusive with the option --file-path

As a result, the UMAP dimensionality reduction will be applied to the encodings, and the plot will be placed in the file, e.g., ./logs/test_run/umap-mnist_train_stage-ep0135_encoded.png. Here is an example of this plot:

About

ClusterGAN adaptation for climate data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages