You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/dpo_trainer.mdx
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -119,6 +119,8 @@ The [NCA](https://arxiv.org/abs/2402.05369) authors shows that NCA optimizes the
119
119
120
120
The [TR-DPO](https://arxiv.org/pdf/2404.09656) paper suggests syncing the reference model weights after every `ref_model_sync_steps` steps of SGD with weight `ref_model_mixup_alpha` during DPO training. To toggle this callback use the `sync_ref_model` flag in the `DPOConfig`.
121
121
122
+
The [RPO](https://arxiv.org/abs/2404.19733) paper implements an iterative preference tuning algorithm using a loss related to the RPO loss in this [paper](https://arxiv.org/abs/2405.16436) that essentially consists of the SFT loss on the chosen preferences together with a weighted DPO loss. To use this loss set the `rpo_alpha` in the `DPOConfig` to an appropriate value.
123
+
122
124
## Logging
123
125
124
126
While training and evaluating we record the following reward metrics:
"""Compute the log probabilities of the given labels under the given logits.
1097
1098
1098
1099
Args:
1099
1100
logits: Logits of the model (unnormalized). Shape: (batch_size, sequence_length, vocab_size)
1100
1101
labels: Labels for which to compute the log probabilities. Label tokens with a value of label_pad_token_id are ignored. Shape: (batch_size, sequence_length)
1101
-
average_log_prob: If True, return the average log probability per (non-masked) token. Otherwise, return the sum of the log probabilities of the (non-masked) tokens.
1102
1102
label_pad_token_id: The label pad token id.
1103
1103
is_encoder_decoder: Whether the model is an encoder-decoder model.
1104
1104
1105
1105
Returns:
1106
-
A tensor of shape (batch_size,)containing the average/sum log probabilities of the given labels under the given logits.
1106
+
A Tuple of two tensor of shape ((batch_size,), (batch_size,)) containing the sum of log probabilities of the given labels under the given logits in the first tensor and the number of non-masked tokens in the second tensor.
1107
1107
"""
1108
1108
iflogits.shape[:-1] !=labels.shape:
1109
1109
raiseValueError("Logits (batch and sequence length dim) and labels must have the same shape.")
0 commit comments