Vanishing Gradients


Comparing the activation functions

Description

This example explains the problem of vanishing gradients (which you may encounter when training a deep neural network) and also use some activation functions to prevent it. It describes the situation where a deep multilayer feed-forward network or a recurrent neural network is unable to propagate useful gradient information from the output end of the model back to the layers near the input end of the model.

Many fixes and workarounds have been proposed and investigated, such as alternate weight initialization schemes, unsupervised pre-training, layer-wise training, and variations on gradient descent. Perhaps the most common change is the use of the rectified linear activation function and its modifications.


How the choice of activation function avoids vanishing gradients?


Activation functions, like the sigmoid function, squishes a large input space into a small input space between 0 and 1. Therefore, a large change in the input of the sigmoid function will cause a small change in the output. Hence, the derivative becomes small.A small gradient means that the weights and biases of the initial layers will not be updated effectively with each training session. Since these initial layers are often crucial to recognizing the core elements of the input data, it can lead to overall inaccuracy of the whole network.

The simplest solution is to use other activation functions, such as leakyRelu, Relu etc which don't cause a small derivatives. The really nice thing about Relu is the the gradient is either 0 or 1, which means it never saturates, and so gradients can’t vanish — they are transferred perfectly across a network. However problem of dead relu may exist i.e. situation may come when grads may become exactly 0 but this problem is solved by its modification LeakyRelu.


You can visualize it by plotting these activation functions and many more by the options given below...

About the dataset and model

This example uses a fully connected neural network . The data used for each flower are the petal length and width as well as the sepal length and width. The data comes from the famous Iris flower data set.

Instructions

  1. Using the options below you can set the activation function, num_layers, num_neurons_per_Layer, batch size, learning rate, num_iterations according to your choice.

  2. You can visualize the neural network of your choice and visualize the gradients w.r.t each weight by analyzing the intensity of links connecting the neurons. Positive gradients are represented by blue links and negative gradients are represented by red links.
    Note that gradient values at final iteration are used in printing nn architecture.

  3. In each iteration, input of chosen batch size is randomly selected from 120 egs out of 150 egs provided by iris dataset and then model parameters are optimized using gradient descent. Rest 30 egs are for validation. Also you can see value of loss at each iteration in console.

  4. Also a plot (Loss vs iteration) will show up on clicking the given button.

  5. It is strictly advised to keep batch size greater than 1, otherwise you may encounter exploding gradients problem showing black colour links all over in architecture.

  6. Wait for some time after clicking the button.

  7. Change the custom parameters and again press the button to train the changed neural network.

Controls

Train Model