Week 12 - Training & Final outcome

Wenbei Sun
Apr 28, 2021
4 min read

GAN Model

Dataset collection

There are open-source pre-trained models of StyleGAN2 available to us, but after testing a few models, we found them unable to satisfy our needs. To recreate a dreamy and colourful effect with our GAN model, we chose to collect the dataset ourselves and train the model from the ground up.

We also consider copyright. To avoid Copyright issues, we gathered our training data using flikr.com to filter out desired images with a "Modifications Allowed" licensing.

We pre-processed our data by removing all images with a resolution under 800x800 and resizing the remaining images to precisely 1024x1024 to ensure a model with good resolution. This leaves us with roughly 22,000 images.

Model training

We trained our model using NVIDIA's official release of StyleGAN2-ada. Training configuration is paper1024, same as the MetFaces 1024x1024 resolution training done by NVIDIA.

Training is done on a single RTX2080Ti GPU, which is able to train with a batch of 4,000 images roughly every 20 minutes. After about four days and 1,100,000 images, we finished the training, going through our dataset 500 times.

Model Iteration

After the model is trained, it's used to generate a projection image with an input of our own painting. In the process, some paintings cannot be projected by the model satisfactorily. This is largely because our dataset consists of landscape paintings and lacks the likeliness of objects.

As our measure to make the model more expressive, we added paintings to the dataset: around 5,000 paintings of houses, boats, flowers, birds, people. With the added data to the dataset, the model was trained 24 hours.

Image Creation

Creating a Target Painting

For a project to help express the ballads in an image form, the ballads have to be interpreted. At first, we tried to have each line of the ballads interpreted by a Natural Language Processing (NLP) model. But it later became clear to us that the variety and expressiveness of ballads are far from interpretable by today's NLP technology. In the end, we decided that interpreting the ballads is a task best suited for the human mind.

We create a painting for each line of the ballad we want to interpret. Based on the content of the line, the painting can be of a certain object, scenery, or abstract colours and shapes. We first analyzed the contents of the ballads to categorized the keywords and emotional words, then, based on the keywords, to do the paintings. We took seven ballads as examples and drew five illustrations per ballad, so we produced 35 illustrations in total. All the illustrations were painted on the Procreate. ( We post all of our illustrations on last week's blog)

The drawing is then used as a target image for our GAN projection. This way, we essentially guided our AI based on the content of the ballads. For the public exhibition on the Inspace Screens, we are going to use all seven screens.

- Image Projection -

A trained StyleGAN2 model is able to generate new artworks with a designated random seed. However, this kind of image generation is completely random and doesn't allow us to influence the resulting image in any way.

Luckily, the state-of-the-art StyleGAN2 uses a progressive, growing architecture that generates images from a latent vector. In our case, every image generated by the model can be represented by a 512 dimension vector. A small change in the latent vector itself corresponds to a slight change in the resulting image. This feature allows the model to search in the latent space for a specific image incrementally, a process called projection. With each step of the search, the model will generate an image closer to a target image.

With this functionality, we can use the drawing of our creation to guide the model to generate a similar image by projecting our illustration to the model's latent space. However, the projecting ability of a model does have limitations. Some details of the drawing can be lost in translation. This means that sometimes we don't get quite what we want, while other times the projection results can be a pleasant surprise.

Neighbour Selection

As the "human mind" of the project, we want the generated image to be expressive, full of dramatic colours and interesting texture. As mentioned before, because the projection done by the AI is purely by mathematically minimizing the distance between target image and projection result, sometimes the image projection doesn't produce quite the results we want. Even if the projection results are great, there might be better outcomes of the resulting image near them. For this reason, we perform a neighbour selection process on each projected image.

With each projection, the result comes to its 512 dimensions latent vector. We can add a 512 dimension Gaussian random vector on the latent vector, and the resulting vector can generate a neighbouring image in the latent space. This image will look similar to the original projection result, with the shapes, texturing and colouring slightly different. We generate 100 random neighbours for each projection result and select the one most to our liking as the final projection result.

Generating Video Sequence

As we mentioned before, we produced 35 paintings, five from each of the seven selected ballads, to be exhibited on the Inspace Screens. After our styleGAN model played its part, we have 35 images composed of human and AI collaboration.

To enhance the effect of these images, we composed seven videos, each representing a ballad. Assisted by the powerful functionality of latent space, we are able to create a flow of images with noise loop and interpolation. A noise loop allows the image to "jitter" around the centre image, creating a rippling effect in the video. An interpolation from one latent vector to the next creates a silk-smooth morphing from one image to another. Combining the two transitioning effects created a video of what we consider a vibrant and hypnotizing effect.

Noise Loop

We implemented the noise loop with the same technique as the neighbour selection process by adding a random noise vector to the latent vector.

Each time we add a random noise vector to the latent vector, we created a "stop." For each image, we created seven such stops. Every stop is a noise added upon the latent vector before it.

Then we linearly interpolate between each pair of the stops, creating 30 frames. In a 30 fps video, the seven stops after the original image take up 7 seconds to run through.

Interpolation

Between these images, we morph from an image to the next again by linearly interpolating between latent vectors. Each interpolation has 150 frames and accounted for 5 seconds of running time.

Week 12 - Training & Final outcome

Recent Posts

Comentarios