Skip to content

aleju/cat-generator-deprecated

Repository files navigation

Note: This project is deprecated and was replaced by a new version, which uses a new architecture for G to get rid of the laplacian pyramid training steps. That simplifies the whole training process significantly. The new architecture also results in higher quality images.


About

This script generates new images of cats using the technique of generative adversarial networks (GAN), as described in the paper by Goodfellow et al. The images are enhanced with the laplacian pyramid technique from Denton and Soumith Chintala et. al. Most of the code is based on facebook's eyescream project. It also uses code from other repositories for spatial transformers, weight initialization and LeakyReLUs.

Images

Full sized example images

64 generated 32x32 images, started as 16x16 and upscaled in two steps to 32x32. Cherry-picked among 20 such blocks of 64 images each. About 80% of all images look like arse after upscaling, 10% fairly bad and ~10% ok-ish to good. (Interestingly, they will still look subjectively bad when being completely surrounded by bad looking images.) The upscaling steps amplify even the slightest errors and introduce new ones. Even ~50% of the training images look bad to horrible after upscaling.

1024 generated 16x16 images

1024 16x16 cat images, randomly generated by G2 trained via th train3.lua --D_iterations=3.

1024 generated 16x16 images

1024 16x16 cat images, randomly generated by G from another experiment trained via th train.lua --D_iterations=2.

Training progess by epoch

Training progress by epoch from epoch 1 to 271, where each epoch was a random selection of 1000 images from the training set (16x16 generator, starts pretrained at first epoch, trained via th train.lua --D_iterations=2).

Nearest neighbours of generated 16x16 images

16 generated images (each pair left) and their nearest neighbours from the training set (each pair right). Distance was measured by 2-Norm (torch.dist()). The 16 selected images were the "best" ones among 1024 images according to the rating by D, hence some similarity with the training set is expected.

Background Knowledge

The basic principle of GANs is to train two networks in a kind of forger-police-relationship. The forger is called G (generator) and the police D (discriminator). It is D's job to take a look at an image and estimate whether it is a fake or a real image (where "real" is synonymous with "from the training set"). Naturally it's G's job to generate images that trick D into believing that they are from the training set. With a large enough training set and some regularization strategies, D cannot just memorize the training set. As a result, D must learn the general rules that govern the look of images from the training set (i.e. a generalizing function). Similarly, G must learn how to "paint" new images that look like the ones from the training set, otherwise it would not be able to trick D.

The previously mentioned laplacian pyramid technique for GANs is pretty straight-forward: Instead of training G and D on full-sized images (e.g. 64x64 pixels) you train them on small ones (e.g. 8x8 pixels). Afterwards you increase the size of the generated images in multiple steps to the final size, e.g. from 8x8 to 16x16 to 32x32 to 64x64. For each of these steps you train another pair of G and D, but in case of these upscaling steps they are trained to learn good refinements of the upscaled (and hence blurry) images. That means that D gets fed refined/sharpened images and must tell, whether these were real images (i.e. blurry images from the training set with optimal refinements) or fake images from G (i.e. blurry images from the training set, but the refinement was done by G). Again, G must learn to generate good refinements and D must learn what good refined images look like. The image below (taken from the paper) shows the process (they start with the full sized images, the one on the far right could be generated by a GAN). Note that this training methodology is similar to how one would naturally paint images: You start with a rough sketch (low resolution image) and then progressively add more and more details (increases in resolution).

Laplacian pyramid

Requirements

  • Torch with the following libraries (most of them are probably already installed by default):
    • nn (luarocks install nn)
    • paths (luarocks install paths)
    • image (luarocks install image)
    • optim (luarocks install optim)
    • cutorch (luarocks install cutorch)
    • cunn (luarocks install cunn)
    • dpnn (luarocks install dpnn)
    • stn (see here)
    • display
  • Python 2.7 (only tested with that version)
    • scipy
    • numpy
    • scikit-image
  • 10k cats dataset
  • CUDA capable GPU (~3GB memory or more) with cudnn3

Usage

Preperation steps:

  • Install all requirements as listed above.
  • Download and extract the 10k cats dataset into a directory, e.g. /foo/bar. That folder should then contain the subfolders CAT_00 to CAT_06.
  • Clone the repository.
  • Switch to the repository's subdirectory dataset via cd dataset and convert your downloaded cat images into a normalized and augmented set of ~100k cat faces with python generate_dataset.py --path="/foo/bar". This may take a good two hours or so to run through, as it performs lots of augmentations.

Before training:

  • Start display with ~/.display/run.js &
  • Open http://localhost:8000/ in your browser (plotting interface by display).
  • Train V for 50 epochs with th train_v.lua --grayscale. (Wait for a saving network to <path> message, then stop manually.) (~10min)
  • Pretrain G for 20 epochs with th pretrain_g.lua --grayscale. (Wait for a saving network to <path> message, then stop manually.) (This step can be skipped.) (~10min)

Training options:

  • Train a 16x16 base pair of G/G1 and D with th train.lua --grayscale --D_iterations=2. Takes about 150 epochs to get good. (~3h)
  • Train a 16x16 base tuple of G1, G2 and D with th train2.lua --grayscale --D_iterations=2. Takes about 150 epochs to get good. (~3h)
  • Train a 16x16 base tuple of G1, G2, G3 and D with th train3.lua --grayscale --D_iterations=3. Takes about 150 epochs to get good. (~3h)
  • Train a 16 to 22 refiner with th train_c2f.lua --grayscale --coarseSize=16 --fineSize=22 --D_iterations=2. Takes maybe 100 epochs or so to get good. More epochs seems to be better. Quality between epochs seems to vary quite a lot, so you have to take care of stopping on a good save. (~3h+)
  • Train a 22 to 32 refiner with th train_c2f.lua --grayscale --coarseSize=22 --fineSize=32. Takes maybe 100 epochs or so to get good. More epochs is better. Performance between epochs seems to vary quite a lot, so you have to take care of stopping on a good save. (~3h+)
  • Train a 16 to 22 refiner that includes images from the 16x16 network during training with th train_c2f_smallg.lua --grayscale --coarseSize=16 --fineSize=22 --G_L2=1e-6.
  • Train a 22 to 32 refiner that includes images from the 16x16 network and the 16 to 22 refiner during training with th train_c2f_smallg.lua --grayscale --coarseSize=22 --fineSize=32.

After training:

  • Sample images to directory samples/ with th sample.lua --grayscale. That script expects you to have trained 16x16, 16 to 22 and 22 to 32 networks. You can however easily comment out the code parts that upscale images.
  • If you want to sample from G2 or G3 (train2.lua / train3.lua) you will have to edit sample.lua and change the line local G = file.G (in method loadModels()) to local G = file.G2 (G2) or local G = file.G3 (G3). You will also have to set --G_base="adversarial2.net" and --D_base="adversarial2.net" (for train2.lua, use 3 instead of 2 for train3.lua).

Notes:

  • Results of experiments seem to not be reproduceable (when rerunning with the same hyperparameters). I don't know the reason for that, as I set all seeds via math.randomseed(), torch.manualSeed(), cutorch.manualSeed(). The networks shown above in the images seemed to have been good local optima, so you may have to run experiments several times to get results of the same quality. You should however get on (basically) every run images that clearly resemble cats.
  • The train_c2f_smallg.lua can only be run for 16 to 22 and 22 to 32 mode. Other settings will result in errors.
  • Noteworthy command line attributes to tinker with are:
    • N_epoch=integer: Number of images to randomly pick from the training set per epoch.
    • batchSize=integer: Size of each batch. Should be divideable by 2 and >= 4.
    • D_iterations=integer: How often to train D per batch. Every iteration will be filled with new randomly picked examples.
    • G_iterations=integer: How often to train G per batch.
    • G_L1=float: L1 norm of G.
    • G_L2=float: L2 norm of G.
    • D_L1=float: L1 norm of D.
    • D_L2=float: L2 norm of D.

V

V (the Validator) is intended to be a half-decent replacement of validation scores, which you don't have in GANs. V's architecture is - similarly to D - a convolutional neural network. Just like D, V creates fake/real judgements for images, i. e. it rates how fake images look. V gets fed images generated by G and rates them. The mean of that rating can be used as the mentioned validation score replacement. V is trained once before the 16x16 run. During that training, V sees real images from the dataset as well as synthetically generated fake images. The methods to generate the synthetic images are roughly:

  • Random mixing of two images.
  • Random warping of an image (i. e. move parts of the image around, causing distortions).
  • Random stamping of an image (i. e. replace parts of the image by parts from somewhere else in the image).
  • Randomly throw random pixel values together (with some gaussian blurring technique, so that its not just gaussian noise).

These techniques are then sometimes combined with each other, e. g. one image is modified by warping, another by stamping and then both are mixed into one final synthetic image.

The rating by V was sometimes quite off and overall noticeably worse than a good validation set with an accuracy/loss value. However, more often than not it seemed to at least roughly resemble the real image quality.

Architecture

All networks are optimized for grayscale image generation. That was mainly the case to simplify the problem and reduce the computational burden. Most of the activations were PReLUs, because they perform better than ReLUs in my experience. Networks with LeakyReLUs seemed to blow up more frequently, so I didn't use them.

G/G1, G2, G3 (base / 16x16 images)

The 16x16 G (aka G1) is a very small network with just one hidden layer (2048 nodes). I tried using multiple hidden layers several times, but that just resulted in frequent blowups or inferior results. Choosing other sizes for the hidden layer also seemed to worsen results.

Architecture of G

G2 is a small upsampling network. It is similar to the coarse to fine Gs, but has less capacity. Notably, it outputs a full new image instead of just a difference (as the coarse to fine Gs do). Using G2 during training seemed to worsen the results of G1 by making them more noisy. G2 would always end up just denoising the images instead of adding new details. (As if they started to fill different niches: One creating more noisy versions of images, the other one creating less noisy versions.) Sometimes that looks good, but usually the loss of G's quality outweighed G2.

  • SpatialConvolutionUpsample, 32 times 3x3
  • PReLU
  • SpatialConvolutionUpsample, 64 times 5x5
  • PReLU
  • SpatialConvolutionUpsample, 64 times 5x5
  • Sigmoid

G3 is similar to G/G1. It takes in images generated by G2 as vectors and returns similarly sized images with one hidden layer of 1024 nodes in between, i.e. the architecture is:

  • Input layer, 1*16*16 = 256 nodes
  • Hidden layer, 1024 nodes
  • PReLU
  • Output layer, 1*16*16 = 256 nodes
  • Sigmoid

G3 was very prone to overfitting, i.e. adding similar features to all images. When it didn't do that, the images often looked good but very uniform. It seemed a bit like adding Gs with hidden layers made the outputs become more similar to the maxima of the probability distribution as approximated by D.

D (base / 16x16 images)

The base D is a standard convolutional network with multiple branches. It uses a spatial transformer at the start to remove rotations. Three of the four branches also have spatial transformers (rotation, translation, scaling), so that they can learn to focus on specific areas of the image. (I don't know if they really did that.) The fourth branch is intended to analyze the whole image.

The transformers are not strictly necessary. Networks without them seemed perform only marginally worse (might have been subjective or luck based). The same was the case for networks with pooling layers, adding them seemed to marginally worsen results. Adding Batch Normalization did not work - it lead to D having basically immediately 100% accuracy, leaving G no time to learn anything (or maybe it somehow messed up the gradients for G). G would then just output garbage and never improve anymore.

Architecture of D

All convolutions were size-preserving. All localization networks of the spatial transformers used the same architecture. The last hidden layer was picked so small to counteract the huge 320*16*16 concat (i.e. to reduce GPU ram and HDD disk requirements). I made some tests with similar architectures and larger (and multiple) hidden layers, but that didn't seem to improve the results.

G and D (coarse to fine / laplacian pyramid)

Both coarse to fine steps use basically the same networks. They are mostly taken from the eyescream project. (I did not spend that much time optimizing these steps, different networks would have most likely been better.)

G 22x22 and 32x32

Architecture:

  • Join noise layer and image layer to two-channel image
  • Spatial convolutional upsampling, 64 times 3x3
  • PReLU
  • Spatial convolutional upsampling, 512 times 7x7
  • PReLU
  • Spatial convolutional upsampling, 1 time 5x5 (i.e. from 512 channels down to a grayscale image)
  • No activation / linear

Using tanh instead of a linear activation at the end didn't seem to work, despite normalizing images to range -1.0 to +1.0.

D 22x22

Architecture:

  • Convolution, 64 times 3x3, no padding (decreases image size to 20x20)
  • PReLU
  • Convolution, 256 times 5x5
  • PReLU
  • Max Pooling 2x2
  • Convolution, 1024 times 3x3
  • PReLU
  • Max Pooling 2x2
  • Dropout
  • Linear/Fully connected layer from 1024*5*5 = 25,600 to 1
  • Sigmoid

D 32x32

Architecture (only change compared to 22x22 D is the added padding in the first layer):

  • Convolution, 64 times 3x3
  • PReLU
  • Convolution, 256 times 5x5
  • PReLU
  • Max Pooling 2x2
  • Convolution, 1024 times 3x3
  • PReLU
  • Max Pooling 2x2
  • Dropout
  • Linear/Fully connected layer from 1024*8*8 = 65,536 to 1
  • Sigmoid

V

The validator is a standard convolutional network.

  • Convolution 128 times 3x3, LeakyReLU
  • Convolution 128 times 3x3, LeakyReLU
  • Spatial Batch Normalization (before relu)
  • Dropout
  • Convolution 256 times 3x3, LeakyReLU
  • Convolution 256 times 3x3, LeakyReLU
  • Spatial Batch Normalization (before relu)
  • Max Pooling (2x2)
  • Spatial Dropout
  • Linear/Fully Connected Layer from 256*8*8 = 16,384 to 1024
  • Batch Normalization
  • LeakyReLU
  • Dropout
  • Linear/Fully Connected Layer from 1024 to 1024
  • Batch Normalization
  • LeakyReLU
  • Dropout
  • Linar/Fully Connected Layer from 1024 to 2
  • Softmax

(A 1-neuron sigmoid output would have probably been more logical.)

Preprocessing, training and sampling procedure

As a preprocessing step, the faces must be extracted from the 10k cats dataset. The dataset contains facial keypoints for each image (ears, eyes, nose), so extracting the faces isn't too hard. Each of the faces gets rotated so that the eyeline is parallel to the x axis (i.e. rotations are removed). That was necessary as many cat images tend to be heavily rotated, making the learning task significantly harder (though that might work now with the addition of Spatial Transformers in D). After that normalizing step, the images are augmented by introducing (now small) rotations, translations, scalings, brightness changes, horizontal flipping and adding minor gaussian noise. That bloats up the total size of the dataset from 10k to roughly 100k images (however these images are often only marginally different, so it's not 100k images worth of information).

Training starts with G and D in the standard GAN setting. G generates images, D rates them. In the case of train2.lua, a second G (G2) is also trained at the same time. G2 takes images from G and applies a few Spatial Upsampling layers to them. Then both G and G2 feed their images into D, while D is trained alternating on G and G2. Similarly in the case of train3.lua another G (G3), which only has one fully connected 1024-neuron hidden layer, is added at the end. G has a standard BCE (binary cross entropy) loss based on D's rating. G2 and G3 get a mixture of a BCE loss on D's rating as well as a BCE loss on the difference between the images generated by G2/G3 and the previous G (G1/G2). For G2 and G3, the gradient from the first loss (D's rating) is multiplied by a factor of 0.01. As a result, G2 and G3 are mostly trained to preserve G's images and - to a low amount - to generate images that D likes. (Any other multiplier (lower/higher) lead to worse results, i.e. more overfitting. For G3 even 0.01 seemed to be too high.)

After finishing the base training, the coarse to fine networks are trained. One from 16 to 22px (height/width) and another one from 22 to 32px (so both roughly +50% size). They follow the mentioned laplacian pyramid technique. G gets fed blurry images and has to create sensible refinements for them (to get rid of the blurriness). D gets fed blurry images with refinements that are either real (calculated based on the original images) or fake (generated by G). The blurry images are created by downscaling an original image to coarse size (e.g. 16x16px) and then upsampling to fine size (e.g. 22x22px). Both refinement pairs get images that are normalized to a range between -1.0 to +1.0. (While the base 16x16 pair of G and D gets images with values in the range of 0.0 to +1.0). Normalizing seemed necessary to get good results (might have been a quirk of the specific architectures).

As an addition to the approach from the paper (about the laplacian pyramid method), images generated by the chain of lower Gs (e.g. 16x16 G then 16 to 22 G) are also fed into the coarse to fine G, just like the real images from the training set. As a result, G has to learn to sell the garbage from the lower levels to D, i.e. to refine these images properly. That seemed to improve image quality after upscaling a bit, though not amazingly so.

When sampling images from the whole network, they are first generated by the 16x16 G (or G2/G3). Then D can be applied to filter out bad images. The result is upscaled to 22x22px, normalized (-1.0 to +1.0) and gets fed into the first coarse to fine G. G creates a couple of possible refinements and D selects the best looking one. The procedure is then repeated for 32x32px images.

Adam was used as the optimizer during training. Batch size was usually 32, i.e. D would see 16 fake and 16 real images and G would get 32 tries to mess with D.

Training procedure overview

Training procedure when using only one 16x16 G as in train.lua.

About

Generate cat images with neural networks (deprecated version)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published