Activation Functions

December 04, 2024

Activation functions play a crucial role in neural networks, typically employed in hidden and output layers, but not in input layers. By default, the absence of an activation function implies a linear activation. Here's a closer look at several common types:

Sigmoid: Characterized by its S-shaped curve, the sigmoid function outputs values between 0 and 1 for any input ranging from negative to positive infinity. While useful, it is prone to causing vanishing gradient issues due to its output range, and its outputs are not zero-centered.
Tanh (Hyperbolic Tangent): Similar to the sigmoid in shape but outputs values from -1 to 1. It offers stronger gradients than sigmoid, making it more effective in some cases. However, it still suffers from vanishing gradient problems like its sigmoid counterpart.
ReLU (Rectified Linear Unit): This function addresses some of the drawbacks of sigmoid and tanh by outputting the input directly if it is positive; otherwise, it outputs zero. Although it helps mitigate vanishing gradient issues, ReLU is not differentiable at zero and can lead to "dying neurons" where outputs become zero for all negative inputs.
Leaky ReLU: To avoid the dying neuron issue, Leaky ReLU modifies ReLU by allowing a small, non-zero output for negative inputs. Typically, the equation is f(x) = x if x > 0, otherwise f(x) = 0.01x. Variants like randomized and parametric. Leaky ReLU allow for the negative slope to be randomized or learned during training, adding flexibility.
Parametric ReLU (PReLU): This function generalizes Leaky ReLU by making the negative slope a parameter that is learned during training, rather than being a fixed value. This flexibility can lead to better model performance but increases model complexity.
Sigmoid Linear Unit (SiLU): SiLU, or Swish, combines properties of sigmoid and ReLU, promoting smooth and non-monotonic function that dynamically adjusts based on the input value, improving model performance.
Softplus: This function serves as a smooth approximation of ReLU, defined as f(x) = ln(1 + exp(x)). It smoothly transitions outputs based on the input but can be computationally intensive and may still encounter vanishing gradients for very negative inputs.
Gaussian Error Linear Unit (GELU): GELU is both smooth and monotonic, making it highly effective in neural network layers. The function is defined as f(x) = x * φ(x), where φ(x) represents the cumulative distribution function of Gaussian. This setup allows for a slight output even for negative inputs close to zero, while larger negative inputs tend toward zero. This behavior introduces non-linearities in a way that generally benefits model training by allowing small gradients when the inputs are near zero.
Exponential Linear Unit (ELU): ELU improves upon the ReLU by outputting non-zero values for negative inputs, thereby solving the dying neuron problem common with ReLU. The function formula is f(x) = x if x > 0, otherwise f(x) = α(exp(x) - 1) for negative inputs. Here, 'α' is a positive constant that prevents outputs from becoming zero. This characteristic enhances the backpropagation process through negative regions and generally results in faster convergence. However, large negative outputs can potentially slow down the learning process.
Scaled Exponential Linear Unit (SELU): Building on the advantages of ELU, SELU is designed to be self-normalizing. It automatically scales outputs of neurons to maintain a mean of zero and standard deviation of one across layers, which helps in stabilizing the gradient in deep networks. The function is expressed as f(x) = λ * x if x > 0, and f(x) = λ * α * (exp(x) - 1) for non-positive values. Here, λ and α are predefined constants that ensure self-normalization. While SELU provides significant improvements in training stability, it is computationally more intensive compared to some simpler functions.
Activation Function Hierarchy for Performance: In practice, the performance hierarchy often observed is SELU > ELU > Leaky ReLU > ReLU > Tanh > Logistic (Sigmoid). This ranking is based on each function's ability to handle issues like vanishing gradients and neuron death, as well as their impact on the speed of convergence and generalization in different types of neural network architectures.

Search This Blog

Large Language Models

Activation Functions

Comments

Post a Comment

Popular posts from this blog

Fine Tuning, Prompt Tuning, and Prompt Engineering

Efficiency in Large Language Model Training: LoRA, Qlora, and Galore

KAN: Kolmogrov Arnold Network