DOC Update arxiv links to hf

qgallouedec · web-flow · commit ed00f24029ab · 2025-11-04T19:33:02.000+01:00
diff --git a/docs/source/conceptual_guides/oft.md b/docs/source/conceptual_guides/oft.md
@@ -16,7 +16,7 @@ rendered properly in your Markdown viewer.
 
 # Orthogonal Finetuning (OFT and BOFT) 
 
-This conceptual guide gives a brief overview of [OFT](https://huggingface.co/papers/2306.07280), [OFTv2](https://www.arxiv.org/abs/2506.19847) and [BOFT](https://huggingface.co/papers/2311.06243), a parameter-efficient fine-tuning technique that utilizes orthogonal matrix to multiplicatively transform the pretrained weight matrices.
+This conceptual guide gives a brief overview of [OFT](https://huggingface.co/papers/2306.07280), [OFTv2](https://huggingface.co/papers/2506.19847) and [BOFT](https://huggingface.co/papers/2311.06243), a parameter-efficient fine-tuning technique that utilizes orthogonal matrix to multiplicatively transform the pretrained weight matrices.
 
 To achieve efficient fine-tuning, OFT represents the weight updates with an orthogonal transformation. The orthogonal transformation is parameterized by an orthogonal matrix multiplied to the pretrained weight matrix. These new matrices can be trained to adapt to the new data while keeping the overall number of changes low. The original weight matrix remains frozen and doesn't receive any further adjustments. To produce the final results, both the original and the adapted weights are multiplied togethor.
 
diff --git a/docs/source/package_reference/road.md b/docs/source/package_reference/road.md
@@ -16,7 +16,7 @@ rendered properly in your Markdown viewer.
 
 # RoAd
 
-[RoAd](https://arxiv.org/pdf/2409.00119) is a parameter‑efficient fine‑tuning technique that adapts large language models by learning a small set of 2×2 rotation matrices (and optional scaling factors) applied to pairs of hidden dimensions. RoAd achieves competitive or superior performance compared to other PEFT methods with under 0.1% trainable parameters. Unlike LoRA’s batched low‑rank updates, RoAd’s sparse rotations reformulate to simple element‑wise operations, yielding significantly higher serving throughput when handling heterogeneous requests in the same batch, i.e. serving multiple adapters simulatenously. Moreover, RoAd integrates seamlessly into a distributed interchange intervention framework, interpreting its sparse 2D rotations as task-specific interventions within learned subspaces of hidden representations. These orthogonal subspaces can be composed to merge multiple task-specific behaviors—like multilingual capabilities or instruction following—without additional fine-tuning, enabling modular, interpretable adaptations in LLMs.
+[RoAd](https://huggingface.co/papers/2409.00119) is a parameter‑efficient fine‑tuning technique that adapts large language models by learning a small set of 2×2 rotation matrices (and optional scaling factors) applied to pairs of hidden dimensions. RoAd achieves competitive or superior performance compared to other PEFT methods with under 0.1% trainable parameters. Unlike LoRA’s batched low‑rank updates, RoAd’s sparse rotations reformulate to simple element‑wise operations, yielding significantly higher serving throughput when handling heterogeneous requests in the same batch, i.e. serving multiple adapters simulatenously. Moreover, RoAd integrates seamlessly into a distributed interchange intervention framework, interpreting its sparse 2D rotations as task-specific interventions within learned subspaces of hidden representations. These orthogonal subspaces can be composed to merge multiple task-specific behaviors—like multilingual capabilities or instruction following—without additional fine-tuning, enabling modular, interpretable adaptations in LLMs.
 
 Finetuning with RoAd typically requires higher learning rate compared to LoRA or similar methods, around 1e-3. Currently RoAd only supports linear layers and it can be used on models quantized with bitsandbytes (4-bit or 8-bit).
 
diff --git a/docs/source/package_reference/shira.md b/docs/source/package_reference/shira.md
@@ -16,7 +16,7 @@ rendered properly in your Markdown viewer.
 
 # Sparse High Rank Adapters
 
-Sparse High Rank Adapters or [SHiRA](https://arxiv.org/abs/2406.13175) is an alternate type of adapter and has been found to have significant advantages over the low rank adapters. Specifically, SHiRA achieves better accuracy than LoRA for a variety of vision and language tasks. It also offers simpler and higher quality multi-adapter fusion by significantly reducing concept loss, a common problem faced by low rank adapters. SHiRA directly finetunes a small number of the base model's parameters to finetune the model on any adaptation task.
+Sparse High Rank Adapters or [SHiRA](https://huggingface.co/papers/2406.13175) is an alternate type of adapter and has been found to have significant advantages over the low rank adapters. Specifically, SHiRA achieves better accuracy than LoRA for a variety of vision and language tasks. It also offers simpler and higher quality multi-adapter fusion by significantly reducing concept loss, a common problem faced by low rank adapters. SHiRA directly finetunes a small number of the base model's parameters to finetune the model on any adaptation task.
 
 SHiRA currently has the following constraint:
 
diff --git a/docs/source/package_reference/waveft.md b/docs/source/package_reference/waveft.md
@@ -16,7 +16,7 @@ rendered properly in your Markdown viewer.
 
 # WaveFT: Wavelet Fine-Tuning
 
-[WaveFT](https://arxiv.org/abs/2505.12532) is a novel parameter-efficient fine-tuning (PEFT) method that introduces sparse updates in the **wavelet domain** of residual matrices. Unlike LoRA, which is constrained by discrete low-rank choices, WaveFT enables fine-grained control over the number of trainable parameters by directly learning a sparse set of coefficients in the transformed space. These coefficients are then mapped back to the weight domain via the Inverse Discrete Wavelet Transform (IDWT), producing high-rank updates without incurring inference overhead.
+[WaveFT](https://huggingface.co/papers/2505.12532) is a novel parameter-efficient fine-tuning (PEFT) method that introduces sparse updates in the **wavelet domain** of residual matrices. Unlike LoRA, which is constrained by discrete low-rank choices, WaveFT enables fine-grained control over the number of trainable parameters by directly learning a sparse set of coefficients in the transformed space. These coefficients are then mapped back to the weight domain via the Inverse Discrete Wavelet Transform (IDWT), producing high-rank updates without incurring inference overhead.
 
 WaveFT currently has the following constraint:
 
diff --git a/examples/road_finetuning/README.md b/examples/road_finetuning/README.md
@@ -3,7 +3,7 @@
 
 ## Introduction
 
-[RoAd](https://arxiv.org/pdf/2409.00119) is a novel method that adapts LLMs using simple 2D rotations. It is highly parameter-efficient,
+[RoAd](https://huggingface.co/papers/2409.00119) is a novel method that adapts LLMs using simple 2D rotations. It is highly parameter-efficient,
 achieving strong performance with less than 0.1% trainable parameters.
 RoAd also supports efficient serving of mixed-adapter requests within a batch, incurring only element-wise computation overhead rather than costly batch matrix multiplications.
 Additionally, it improves model interpretability through structured and composable transformations.
diff --git a/examples/shira_finetuning/README.md b/examples/shira_finetuning/README.md
@@ -1,7 +1,7 @@
 # Sparse High Rank Adapters
 
 ## Introduction
-Sparse High Rank Adapters or [SHiRA](https://arxiv.org/abs/2406.13175) is an alternate type of adapter and has been found to have significant advantages over the low rank adapters. Specifically, SHiRA achieves better accuracy than LoRA for a variety of vision and language tasks. It also offers simpler and higher quality multi-adapter fusion by significantly reducing concept loss, a common problem faced by low rank adapters. SHiRA directly finetunes a small number of the base model's parameters to finetune the model on any adaptation task.
+Sparse High Rank Adapters or [SHiRA](https://huggingface.co/papers/2406.13175) is an alternate type of adapter and has been found to have significant advantages over the low rank adapters. Specifically, SHiRA achieves better accuracy than LoRA for a variety of vision and language tasks. It also offers simpler and higher quality multi-adapter fusion by significantly reducing concept loss, a common problem faced by low rank adapters. SHiRA directly finetunes a small number of the base model's parameters to finetune the model on any adaptation task.
 
 ## Quick start
 ```python
diff --git a/examples/waveft_finetuning/README.md b/examples/waveft_finetuning/README.md
@@ -3,7 +3,7 @@
 # WaveFT: Wavelet Fine-Tuning
 
 ## Introduction
-[WaveFT](https://arxiv.org/abs/2505.12532) is a novel parameter-efficient fine-tuning (PEFT) method that introduces sparse updates in the **wavelet domain** of residual matrices. Unlike LoRA, which is constrained by discrete low-rank choices, WaveFT enables fine-grained control over the number of trainable parameters by directly learning a sparse set of coefficients in the transformed space. These coefficients are then mapped back to the weight domain via the Inverse Discrete Wavelet Transform (IDWT), producing high-rank updates without incurring inference overhead.
+[WaveFT](https://huggingface.co/papers/2505.12532) is a novel parameter-efficient fine-tuning (PEFT) method that introduces sparse updates in the **wavelet domain** of residual matrices. Unlike LoRA, which is constrained by discrete low-rank choices, WaveFT enables fine-grained control over the number of trainable parameters by directly learning a sparse set of coefficients in the transformed space. These coefficients are then mapped back to the weight domain via the Inverse Discrete Wavelet Transform (IDWT), producing high-rank updates without incurring inference overhead.
 
 ## Quick start
 ```python
diff --git a/src/peft/tuners/lora/arrow.py b/src/peft/tuners/lora/arrow.py
@@ -164,8 +164,8 @@ def gen_know_sub(self, lora_A, lora_B):
         This function performs General Knowledge Subtraction. It takes an average of provided general_adapters, and
         subtract it from each task_adapter. This subtraction tries to purify the task adapters, based on
         "forgetting-via-negation" principle. Forgetting-via-negation is a task-arithmetic operation, explained in:
-        https://arxiv.org/abs/2212.04089 The task adapters will be more focused and isolated, enhancing the performance
-        on new tasks.
+        https://huggingface.co/papers/2212.04089 The task adapters will be more focused and isolated, enhancing the
+        performance on new tasks.
 
         Args:
             lora_A : Matrices A in LoRA layer.
diff --git a/src/peft/tuners/lora/config.py b/src/peft/tuners/lora/config.py
@@ -74,8 +74,8 @@ class ArrowConfig:
     """
     This is the sub-configuration class to store the configuration for Arrow and GenKnowSub algorithm. Arrow is a
     routing algorithm to combine the trained LoRA modules to solve new tasks, proposed in
-    'https://arxiv.org/pdf/2405.11157'. GenKnowSub is a refinement on the trained modules before being combined via
-    Arrow, introduced in 'https://aclanthology.org/2025.acl-short.54/'
+    'https://huggingface.co/papers/2405.11157'. GenKnowSub is a refinement on the trained modules before being combined
+    via Arrow, introduced in 'https://aclanthology.org/2025.acl-short.54/'
     """
 
     top_k: int = field(
diff --git a/src/peft/tuners/road/config.py b/src/peft/tuners/road/config.py
@@ -28,7 +28,7 @@
 class RoadConfig(PeftConfig):
     """
     This is the configuration class to store the configuration of a [`RoadModel`]. RoAd adapter is proposed in
-    https://arxiv.org/pdf/2409.00119.
+    https://huggingface.co/papers/2409.00119.
 
     Args:
         variant (Union[`RoadVariant`, `str`]):