# Popular Activation Functions & Implementation

Here in this post we'll be implementing popular deep learning activation functions from the ground up using numpy.

So for you to follow this post you need to things:
- `numpy` and
- Some free time of yours.

If you have written any deep learning code before you likely have used these activations:
- Softmax
- ReLU, LeakyReLU
- and the good-old Sigmoid activation.

In this post I'll implement all these activation functions with numpy and also the derivative of these for the back-propagation.

## ReLU
Let's get the easy one out first. ReLU is called rectified linear unit, where:
![Screen Shot 2020-10-22 at 10.04.36 PM.png](https://cdn.hashnode.com/res/hashnode/image/upload/v1603384480484/kbISErB8u.png)

```python
class ReLU:
    """Applies the rectified linear unit function element-wise. 
        ReLU operation is defined as the following
    """

    def __call__(self, x):
        return np.where(x >= 0, x, 0)

    def gradient(self, x):
        """
        Computes Gradient of ReLU
            Args:
                x: input tensor
            Returns:
                Gradient of X
        """
        return np.where(x >= 0, 1, 0)
```

Usage:
```python
relu = ReLU()
z = np.array([0.1, -0.4, 0.7, 1])
print(relu(z))      # ---> array([0.1, 0. , 0.7, 1. ])
print(relu.gradient(z))   # ---> array([1, 0, 1, 1])
```

## Sigmoid
Sigmoid function is defined as the following

![Screen Shot 2020-10-22 at 10.08.45 PM.png](https://cdn.hashnode.com/res/hashnode/image/upload/v1603384729593/86jxayAM8.png)

The main reason why we use sigmoid function is because it exists between (0 to 1). Therefore, it is especially used for models where we have to predict the probability as an output.Since probability of anything exists only between the range of 0 and 1, sigmoid is the right choice. The function is differentiable.That means, we can find the slope of the sigmoid curve at any two points. The function is monotonic but function’s derivative is not.
The logistic sigmoid function can cause a neural network to get stuck at the training time. The softmax function is a more generalized logistic activation function which is used for multi-class classification.

The element wise `exp` can be done like the following, and if you calculate the derivative you can find that `d/dx sigmoid(x) = sigmoid(x) *(1- sigmoid(x))`.

![Screen Shot 2020-10-22 at 10.35.21 PM.png](https://cdn.hashnode.com/res/hashnode/image/upload/v1603386331423/KaPNq9yvJ.png)

So let's write up the activation function for sigmoid operation:

```python
class Sigmoid:
    """
    Applies the element-wise function
    Shape:
        - Input: :math:`(N, *)` where `*` means, any number of additional
          dimensions
        - Output: :math:`(N, *)`, same shape as the input
    """

    def __call__(self, x):
        return 1 / (1 + np.exp(-x))

    def gradient(self, x):
        r"""Computes Gradient of Sigmoid
        .. math::
            \frac{\partial}{\partial x} \sigma(x) = \sigma(x)* \left (  1- \sigma(x)\right)
        Args:
            x: input tensor
        Returns:
            Gradient of X
        """

        return self.__call__(x) * (1 - self.__call__(x))
```
and the usage would be like this:
```python
import numpy as np

z = np.array([0.1, 0.4, 0.7, 1])
sigmoid = Sigmoid()
return_data = sigmoid(z)

print(return_data)          # -> array([0.52497919, 0.59868766, 0.66818777, 0.73105858])
print(sigmoid.gradient(z))  # -> array([0.24937604, 0.24026075, 0.22171287, 0.19661193])

```

## LeakyReLU
LeakyReLU operation is similar to the ReLU operation also called the Leaky version of a Rectified Linear Unit. It essentially instead of putting zeros everywhere it sees < 0, it puts an predefined -ve slope like -0.1 or -0.2 etc.
You mention an alpha, and it'll put -alpha where X < 0.

The following image shows the difference between ReLU and LeakyReLU

![1_A_Bzn0CjUgOXtPCJKnKLqA.jpg](https://cdn.hashnode.com/res/hashnode/image/upload/v1603385917652/v2BXM0WID.jpeg)

```python
class LeakyReLU:
    """Applies the element-wise function:
    Args:
        - alpha: Negative slope value: controls the angle of the negative slope in the :math:`-x` direction. Default: ``1e-2``
    """
    def __init__(self, alpha=0.2):
        self.alpha = alpha

    def __call__(self, x):
        return np.where(x >= 0, x, self.alpha * x)

    def gradient(self, x):
        """
        Computes Gradient of LeakyReLU
            Args:
                x: input tensor
            Returns:
                Gradient of X
        """
        return np.where(x >= 0, 1, self.alpha)
```

## `tanH` or hyperbolic tangent activation
The sigmoid maps the output between 0-1 but here `tanH` maps the output to -1 and 1. The advantage is that the negative inputs will be mapped strongly negative and the zero inputs will be mapped near zero in the tanh graph.

The function is differentiable. And the function is monotonic while its derivative is not monotonic. The `tanH` function is mainly used classification between two classes.

Let's implement this in code
```python
class TanH:
    def __call__(self, x):
        return 2 / (1 + np.exp(-2 * x)) - 1

    def gradient(self, x):
        return 1 - np.power(self.__call__(x), 2)
```

## Softmax
Softmax loss is used when multi-class classifications are performed. So here is the code:

```python
class Softmax:
    def __call__(self, x):
        e_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
        return e_x / np.sum(e_x, axis=-1, keepdims=True)

    def gradient(self, x):
        p = self.__call__(x)
        return p * (1 - p)
```

