GitHub - MKrinitskiy/ClusterGAN4Climate: ClusterGAN adaptation for climate data

ClusterGAN fork for climate data experiments.

This repository is mostly based on the repository ClusterGAN: A PyTorch Implementation which in turn is based on the Tensorflow implementation of ClusterGAN

Requirements

This codebase is developed within the following environment:

python 3.6.9
pytorch 1.5.0
matplotlib 3.1.3
tqdm 4.47.0
numpy 1.18.1
seaborn 0.10.1
torchvision 0.6.0

In umap-project.py, we also use GPU-enabled implementation of UMAP available in RAPIDSAI suite of open source software libraries and APIs. The line responsible for that is the following:

from cuml import UMAP

One can replace it with the standard implementation simply replacing the line with the following:

from umap import UMAP

in which case, one needs to ensure umap-learn package is installed.

Run ClusterGAN on MNIST

We narrowed the scope of the applications to MNIST only (so far). So the running of ClusterGAN on the MNIST dataset one may use the following command:

CUDA_VISIBLE_DEVICES=gpu_id python train.py --run-name=test_run --batch-size=64 --epochs=500 --num-workers=16 --snapshots-period=15 --latent-dim=32 --cat-num=10

where gpu_id is an id of a GPU to use (for example, 0);

As a result, a directory ./logs/run_name will be made and contain the generated output (models, generated examples, training figures) for the training run.

Options for the script train.py:

option	description
`--run-name`	run name (str, default='test_run')
`--batch-size`	batch size (int, default=64)
`--epochs`	number of training epochs (int, default=200)
`--wass-metric`	Flag for Wasserstein metric (the flag is True if option is set)
`--num-workers`	Number of dataset workers (int, default=1)
`--snapshots-period`	period (in epochs) for saving the snapshots of models (int, default=-1 for saving final models only)
`--latent-dim`	real embeddings dimensionality (int, default=32)
`--cat-num`	categorical embeddings dimensionality, number of categories (int, default=10)

Encoding examples

Next, one may want to apply the encoder network to a data. This script may be useful in this case:

CUDA_VISIBLE_DEVICES=gpu_id python encode.py --run-name=test_run --snapshot-final --num-examples=60000 --batch-size=512 --dataset-train --dataset-workers=16

or the following:

CUDA_VISIBLE_DEVICES=gpu_id python encode.py --run-name=test_run --snapshot-stage=100 --batch-size=512 --dataset-workers=16

As a result, the encoded data file *.npz will be saved in the directory of the run, e.g., ./logs/test_run/mnist_train_stage-final_encoded.npz or ./logs/test_run/mnist_train_stage-ep0135_encoded.npz. The file contains three named arrays: zn, zc_logits, and labels for the real embeddings part of the examples, the categorical part, and for the true labels of the examples.

Options for the script encode.py:

option	description
`--run-name`	run name (str, default=`'test_run'`)
`--snapshot-final`	Flag for using the final snapshot of the models (default behaviour). This option is mutually exclusive with the option `--snapshot-stage`
`--snapshot-stage`	stage of training (epoch, `int`) for loading models snapshot (most close one will be involved in case there is no snapshots for this exact epoch); `-1` means the last one except final one. This option is mutually exclusive with the option `--snapshot-final` and `--snapshot-all`
`--snapshot-all`	encode dataset using all the snapshots made during training. This option is mutually exclusive with the option `--snapshot-final` and `--snapshot-stage`
`--num-examples`	number of examples (int; default is not set, in which case the whole dataset will be encoded)
`--batch-size`	batch size for the inference
`--dataset-train`	The flag indicating the encoding either of the training (if set) subset of data, or test subset (if not set, which is the default behaviour)
`--dataset-workers`	number of workers preprocessing dataset

UMAP dimensionality reduction

Next, one may want to apply UMAP to the encoded data. This script may be useful in this case:

python umap-project.py --run-name=test_run --file-path=./logs/test_run/mnist_train_stage-ep0135_encoded.npz

options for the script umap-cluster.py:

option	description
`--run-name`	run name (str, default=`'test_run'`)
`--file-path`	encodings file path. This option is mutually exclusive with the option `--files-all`
`--files-all`	cluster using encodings of all available training stages (each stage separately). This option is mutually exclusive with the option `--file-path`

As a result, the UMAP dimensionality reduction will be applied to the encodings, and the plot will be placed in the file, e.g., ./logs/test_run/umap-mnist_train_stage-ep0135_encoded.png. Here is an example of this plot:

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
docs		docs
libs		libs
.gitignore		.gitignore
README.md		README.md
encode.py		encode.py
gen-examples.py		gen-examples.py
train.py		train.py
umap-project.py		umap-project.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs

docs

libs

libs

.gitignore

.gitignore

README.md

README.md

encode.py

encode.py

gen-examples.py

gen-examples.py

train.py

train.py

umap-project.py

umap-project.py

Repository files navigation

Requirements

Run ClusterGAN on MNIST

Encoding examples

UMAP dimensionality reduction

About

Releases

Packages

Languages

MKrinitskiy/ClusterGAN4Climate

Folders and files

Latest commit

History

Repository files navigation

Requirements

Run ClusterGAN on MNIST

Encoding examples

UMAP dimensionality reduction

About

Resources

Stars

Watchers

Forks

Languages