Initialization using Update Approximation is a Silver Bullet for Extremely Efficient Low-Rank Fine-Tuning

1 Mohamed bin Zayed University of Artificial Intelligence, UAE
2 Georgia Institute of Technology, USA
3 Massachusetts Institute of Technology, USA
* Indicates Equal Contribution

Abstract

Low-rank adapters have become a standard approach for efficiently fine-tuning large language models (LLMs), but they often fall short of achieving the performance of full fine-tuning. We propose a method, LoRA Silver Bullet or LoRA-SB, that approximates full fine-tuning within low-rank subspaces using a carefully designed initialization strategy. We theoretically demonstrate that the architecture of LoRA-XS—which inserts a trainable r × r matrix between B and A while keeping other matrices fixed—provides the precise conditions needed for this approximation. We leverage its constrained update space to achieve optimal scaling for high-rank gradient updates while removing the need for hyperparameter tuning. We prove that our initialization offers an optimal low-rank approximation of the initial gradient and preserves update directions throughout training. Extensive experiments across mathematical reasoning, commonsense reasoning, and language understanding tasks demonstrate that our approach exceeds the performance of standard LoRA while using 27-90x fewer parameters, and comprehensively outperforms LoRA-XS. Our findings establish that it is possible to simulate full fine-tuning in low-rank subspaces, and achieve significant efficiency gains without sacrificing performance.

LoRA-SB Illustration

LoRA-SB Illustration
LoRA-XS reduces parameter count compared to LoRA by inserting a trainable r × r matrix R between B and A, while keeping other matrices fixed, leading to W = W0 + sBRA. Our method, LoRA-SB, leverages the same architecture. We find that updating R using its gradients gR is equivalent to updating the full-finetuning matrix W with an equivalent gradient gSB = sBgRA. We initialize B, R, and A such that the equivalent gradient gSB optimally approximates the full fine-tuning gradient g in low rank subspaces at each training step. In essence, we simulate the entire full fine-tuning process optimally within low-rank subspaces by utilizing only the initial gradient g1 (shown in green) from full fine-tuning.

Contributions

  • We theoretically formalize the limitations of LoRA-XS, showing how its constrained update space leads to suboptimal gradient approximation, initialization sensitivity, and hyperparameter dependence.
  • We propose a principled initialization strategy derived from approximating the first step of full fine-tuning, proving it provides optimal low-rank approximation of the initial gradient and preserves update directions throughout training.
  • We prove that our initialization makes gradient optimization hyperparameter-independent and guarantees convergence by maintaining orthonormal bases, eliminating the need for any tuning of the scaling factor.
  • Through extensive experiments on 4 models across 16 datasets covering mathematical reasoning, commonsense reasoning, and language understanding tasks, we demonstrate that our method surpasses the performance of LoRA while using 27-90x less parameters, and comprehensively outperforms LoRA-XS.

Method

Please refer to our paper for a detailed explanation of our method and the associated proofs. We provide a pseudo-code of our algorithm below.

LoRA-SB Algorithm

Main Results

-->

BibTeX


        @misc{ponkshe2024initializationusingupdateapproximation,
          title={Initialization using Update Approximation is a Silver Bullet for Extremely Efficient Low-Rank Fine-Tuning}, 
          author={Kaustubh Ponkshe and Raghav Singhal and Eduard Gorbunov and Alexey Tumanov and Samuel Horvath and Praneeth Vepakomma},
          year={2024},
          eprint={2411.19557},
          archivePrefix={arXiv},
          primaryClass={cs.CL},
          url={https://arxiv.org/abs/2411.19557}, 
    }