Activation functions, like the sigmoid function, squishes a large input space into a small input space between 0 and 1. Therefore, a large change in the input of the sigmoid function will cause a small change in the output. Hence, the derivative becomes small.A small gradient means that the weights and biases of the initial layers will not be updated effectively with each training session. Since these initial layers are often crucial to recognizing the core elements of the input data, it can lead to overall inaccuracy of the whole network.

The simplest solution is to use other activation functions, such as leakyRelu, Relu etc which don't cause a small derivatives. The really nice thing about Relu is the the gradient is either 0 or 1, which means it never saturates, and so gradients can’t vanish — they are transferred perfectly across a network. However problem of dead relu may exist i.e. situation may come when grads may become exactly 0 but this problem is solved by its modification LeakyRelu.

You can visualize it by plotting these activation functions and many more by the options given below...

Vanishing Gradients

Train Model