Activation functions

4 min readAug 23, 2022

Photo by Tara Winstead: https://www.pexels.com/photo/robot-s-hand-on-a-blue-background-8386437/

Activations functions are the building blocks of neural network to introduce non linearity and learn data patterns. In this blog, we will discuss few activation functions.

1. Step

Typically used to give introduction of Activation Function and not used mostly as it is not differentiable and we cannot find the gradients as well. Think of sigmoid as approximation or smoothed version of this which is differentiable.

2. Sigmoid

As the range falls between 0 and 1 , it is used in binary classification which has implicit probability representation. It is not used widely in hidden layers because of vanishing gradient issue.

Tensorflow Method : tf.keras.activations.sigmoid

3. Linear

When the output label is not restricted to certain range for e.g. regression, then linear activation function should be used.

Tensorflow Method : tf.keras.activations.linear

4. Relu

Since the function and its gradient computation complexity is very small, it is used in hidden layers for faster convergence. One of the disadvantage is the dead neurons which can be avoided by using Leaky Relu.

Tensorflow Method : tf.keras.activations.relu

5. Tanh

Provides strong gradient compared to sigmoid, the range of the derivates of Tanh is larger than sigmoid eliminating vanishing gradient. Also it is experimented that while using Tanh activation, it converges faster than sigmoid due to its zero centered nature (mean of tanh function is zero)

Tensorflow Method : tf.keras.activations.tanh

6. Exponential Linear Unit (ELU)

It does not suffer from dead neurons, exploding gradients, vanishing gradients , it is observed that it converges pretty much fast than ReLU. We can try this activation function instead of ReLu in the hidden layers and check which gives better performance.

Tensorflow Method : tf.keras.activations.elu

7. Gaussian Error Linear Unit (GELU)

This is used in Bert,GPT-3 and other transformers as it outperforms Relu/Leaky Relu. In Leaky Relu, there is sharp raise of negative values in the output for the negative input, whereas in GELU it is smoothened and tends towards zero.

Tensorflow Method : tf.keras.activations.gelu

8. Hard Sigmoid

It is faster than Sigmoid because of its linear nature(piecewise function). We can use this activation if we prefer computation speed than the accuracy of the model for a quick check on complex neural network.

Tensorflow Method : tf.keras.activations.hard_sigmoid

9. Scaled Exponential Linear Unit (SELU)

It has implicit normalization, when the data contains noise, as the data go through the network, SELU will converge data towards mean = 0 and variance =1 which implies better regularization.

Tensorflow Method : tf.keras.activations.selu

10. Softplus

It is similar to Relu , but smooth differentiable curve for negative inputs. The derivative of softplus is a sigmoid function, not used widely as of now.

Tensorflow Method : tf.keras.activations.softplus

11. Softsign

It looks like Tanh which ranges between -1 and +1 but a smoother function, it can be used as alternative to Tanh if performance is comparatively better but not widely used due to slow convergence.

Tensorflow Method : tf.keras.activations.softsign

12. Swish

It is observed that it performs well as compared to Relu. Experimentation has been done on datasets such as ImageNet , Inception e.t.c by using Swish instead of Relu in hidden layers and the results are phenomenal where it improved accuracy on ImageNet by 0.9% for Mobile NASNet-A and 0.6% for Inception-ResNet-v2.

Tensorflow Method : tf.keras.activations.swish

13. Exponential

ELUs has faster learning, and also significantly better generalization performance than ReLUs and LReLUs on networks with more than 5 layers.

Tensorflow Method : tf.keras.activations.exponential