Efficiency in Large Language Model Training: LoRA, Qlora, and Galore

 Training large language models (LLMs) is a resource-intensive process, primarily due to the vast number of parameters involved. Various methods have been developed to improve the efficiency of this process, focusing on reducing memory usage without significantly sacrificing model performance.

LoRA: Low-Rank Adaptation

LoRA (Low-Rank Adaptation) is a technique that introduces two low-rank matrices, A and B, into the training process. This method allows the freezing of pre-trained model weights, thereby reducing the number of trainable parameters. By focusing only on adapting these low-rank matrices, LoRA enables finer model tuning while leveraging existing, well-optimized model architectures.

QLoRa: Quantized Low-Rank Adaptation

Building on the foundation of LoRA, QLoRa incorporates a 4-bit quantized pre-trained model with low-rank adapters. This approach aims to maintain the efficiency benefits of LoRA while further reducing the memory footprint through quantization.

Galore: Gradient Approximation for Low-Rank Updates

Despite its advantages, one limitation of LoRA is that it only supports fine-tuning and may result in degraded accuracy. Galore addresses this by supporting both pre-training and fine-tuning phases. Unlike LoRA, which approximates weight updates, Galore approximates the gradients themselves. It uses Singular Value Decomposition (SVD) to decompose each gradient matrix into two smaller matrices, P and Q, aiming to reconstruct an approximation of the gradient matrix. This method allows Galore to update the low-rank factors P and Q iteratively, optimizing memory usage by only storing these factors instead of the entire gradient matrix.

QGalore: Adaptive Quantization in Gradient Subspaces

To enhance the efficiency of Galore, QGalore introduces adaptive updates within the gradient subspaces while maintaining a compact memory format. It preserves the gradient matrix in a 4-bit format for memory efficiency and uses an 8-bit format for weights, in contrast to the 16-bit representation typically used in Galore. This quantization not only reduces the memory requirements further but also allows the model to adapt more dynamically to changes in the gradient subspace.

Lora: https://arxiv.org/abs/2106.09685
Qlora: https://arxiv.org/abs/2305.14314
Galore: https://arxiv.org/abs/2403.03507
QGalore: https://arxiv.org/abs/2407.08296

Comments

Popular posts from this blog

Fine Tuning, Prompt Tuning, and Prompt Engineering

KAN: Kolmogrov Arnold Network