Large Language Models

Posts

Efficiency in Large Language Model Training: LoRA, Qlora, and Galore

December 05, 2024

Training large language models (LLMs) is a resource-intensive process, primarily due to the vast number of parameters involved. Various methods have been developed to improve the efficiency of this process, focusing on reducing memory usage without significantly sacrificing model performance. LoRA: Low-Rank Adaptation LoRA (Low-Rank Adaptation) is a technique that introduces two low-rank matrices, A and B, into the training process. This method allows the freezing of pre-trained model weights, thereby reducing the number of trainable parameters. By focusing only on adapting these low-rank matrices, LoRA enables finer model tuning while leveraging existing, well-optimized model architectures. QLoRa: Quantized Low-Rank Adaptation Building on the foundation of LoRA, QLoRa incorporates a 4-bit quantized pre-trained model with low-rank adapters. This approach aims to maintain the efficiency benefits of LoRA while further reducing the memory footprint through quantization. Galore: Gradi...

R-tuning

December 05, 2024

R tuning: Large language models (LLMs) often face issues with generating incorrect or hallucinated content. Various methods have been proposed to address this challenge, one of which includes the use of Retrieval Augmented Generation (RAG) techniques. This paper introduces a novel approach called R-tuning, aimed at teaching LLMs to better handle uncertainty in questions. They applied a pre-trained model to a dataset composed of questions and their corresponding answers. This dataset was divided into two subsets based on the congruence between the predicted and actual answers: D0 : The subset where the model’s prediction does not match the ground truth. D1 : The subset where the model’s prediction aligns with the ground truth. In the D1 subset, where predictions were accurate, they prepended the phrase "I am sure" to the model's responses. Conversely, in the D0 subset, where predictions were incorrect, they used the padding "I am unsure." This method of explici...

Flash Attention

December 04, 2024

Advancements in Attention Mechanisms: Flash Attention vs. Vanilla Attention Attention mechanisms are pivotal in modeling sequences in deep learning. Vanilla attention, with its complexity of O ( n 2 ) O(n^2) , involves multiplying queries with keys and values, which can be computationally expensive. To optimize this, methods like sparse attention and low-rank approximations have been introduced. However, these methods are mere approximations of the exact attention mechanism. Flash Attention: A Breakthrough in Attention Mechanism Efficiency Flash Attention emerges as a true game-changer by providing exact attention computations with significantly reduced complexity. Unlike Vanilla Attention, Flash Attention focuses primarily on reducing floating point operations (FLOPs) while often neglecting memory access overheads, Flash Attention effectively addresses both aspects, boasting an attention time complexity of O(n). This is a stark improvement over the O ( n log ⁡ n ) O(n\log n) ...

Activation Functions

December 04, 2024

Activation functions play a crucial role in neural networks, typically employed in hidden and output layers, but not in input layers. By default, the absence of an activation function implies a linear activation. Here's a closer look at several common types: Sigmoid: Characterized by its S-shaped curve, the sigmoid function outputs values between 0 and 1 for any input ranging from negative to positive infinity. While useful, it is prone to causing vanishing gradient issues due to its output range, and its outputs are not zero-centered. Tanh (Hyperbolic Tangent): Similar to the sigmoid in shape but outputs values from -1 to 1. It offers stronger gradients than sigmoid, making it more effective in some cases. However, it still suffers from vanishing gradient problems like its sigmoid counterpart. ReLU (Rectified Linear Unit): This function addresses some of the drawbacks of sigmoid and tanh by outputting the input directly if it is positive; otherwise, it outputs zero. Although i...

Astute RAG: Overcoming Imperfect Retrieval Augmentation and Knowledge

November 26, 2024

Astute RAG: Overcoming Imperfect Retrieval Augmentation and Knowledge Retrieval-Augmented Generation (RAG) models often grapple with challenges stemming from the use of imperfect, irrelevant, or misleading information during the retrieval process. Despite the prevalence of these issues, there is scant research on the conflicts that arise between a large language model's (LLM) internal knowledge and the external sources it retrieves from. To address this gap, here introduced Astute RAG, a refined approach designed to enhance the synergy between LLMs and retrieval systems. Astute RAG improves upon traditional RAG models by meticulously combining consistent information from both internal and external sources. It employs advanced mechanisms to identify and resolve conflicts between these sources, ensuring that only relevant and accurate information influences the generation process. By filtering the misleading or irrelevant content, Astute RAG significantly enhances the reliability a...

KAN: Kolmogrov Arnold Network

November 26, 2024

Introducing Kolmogorov-Arnold Networks (KANs): A Novel Approach to Deep Learning Architectures While Multilayer Perceptrons (MLPs) have been foundational to the development of deep learning architectures, their design places activation functions directly on neurons. In thist work, they propose a transformative approach called Kolmogorov-Arnold Networks (KANs), which repositions activation functions from neurons to the connections between them specifically, on the weights. This innovative change is not just a minor tweak but is deeply rooted in mathematical approximation theories. This research demonstrates that KANs offer improved accuracy and interpretability over traditional MLPs. This approach is based on the Kolmogorov-Arnold representation theorem (KART), contrasting sharply with the universal approximation theorem (UAT) that inspires MLPs. While UAT posits that a network cannot achieve infinite accuracy with a fixed width, KART suggests the possibility under certain conditio...

Search This Blog

Large Language Models

Posts

Fine Tuning, Prompt Tuning, and Prompt Engineering

Efficiency in Large Language Model Training: LoRA, Qlora, and Galore

R-tuning

Flash Attention

Activation Functions

Astute RAG: Overcoming Imperfect Retrieval Augmentation and Knowledge

KAN: Kolmogrov Arnold Network