Skip to content

Commit e30300b

Browse files
fix docs (#3514)
1 parent 1fcc3f4 commit e30300b

File tree

4 files changed

+645
-0
lines changed

4 files changed

+645
-0
lines changed

docs/source/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,7 @@ Swift DOCUMENTATION
4646
BestPractices/GRPO完整流程.md
4747
BestPractices/GRPO多模态训练.md
4848
BestPractices/GRPO代码训练.md
49+
BestPractices/Embedding训练.md
4950
BestPractices/NPU支持.md
5051
BestPractices/更多最佳实践.md
5152

Lines changed: 343 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,343 @@
1+
# Complete Multimodal GRPO Experiment Workflow
2+
3+
This document explains how to use SWIFT GRPO for training multimodal models and tasks. The goal is to train on multiple multimodal tasks to improve task accuracy. Task definitions, training parameters, etc., refer to [R1-V](https://github.com/Deep-Agent/R1-V.git) and [open-r1-multimodal](https://github.com/EvolvingLMMs-Lab/open-r1-multimodal.git).
4+
5+
---
6+
7+
## **ClevrCount Task**
8+
9+
### **Task and Dataset Definition**
10+
11+
This task is based on the `clevr_cogen_a_train` dataset. The model's goal is to output the number of objects in the image. Therefore, we define the dataset as follows:
12+
13+
```python
14+
class ClevrPreprocessor(ResponsePreprocessor):
15+
16+
def preprocess(self, row: Dict[str, Any]) -> Dict[str, Any]:
17+
query = row.get('query', '')
18+
query = f"""{query} Output the thinking process in <think> </think> and
19+
final answer (number) in <answer> </answer> tags."""
20+
row.update({'query': query})
21+
return super().preprocess(row)
22+
23+
24+
register_dataset(
25+
DatasetMeta(
26+
ms_dataset_id='okwinds/clevr_cogen_a_train',
27+
subsets=[
28+
SubsetDataset(
29+
name='default',
30+
subset='default',
31+
split=['train'],
32+
),
33+
],
34+
preprocess_func=ClevrPreprocessor(),
35+
tags=['qa', 'math']))
36+
```
37+
38+
The purpose of redefining the dataset preprocessor here is to modify the query. A sample dataset entry is as follows, including `messages`, `images`, and `solution` fields. The `solution` is used in the reward function, while `messages` and `images` serve as model input.
39+
40+
```json
41+
{
42+
'images': [{'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x01\xe0\x00\x00\x01@\x08\x06\x00\x00\x00d\xc8\xafB`\x82 ...', 'path': 'CLEVR_trainA_000000.png'}],
43+
'messages': [{'role': 'user', 'content': 'How many items are there in the image? Output the thinking process in <think> </think> and\n final answer (number) in <answer> </answer> tags.'}, {'role': 'assistant', 'content': '<answer> 3 </answer>'}],
44+
'solution': '<answer> 3 </answer>'
45+
}
46+
```
47+
48+
---
49+
50+
## **Reward Function Definition**
51+
52+
This task uses two reward functions: one is the format reward function mentioned in `Deepseek-R1`, and the other is the accuracy reward function for ClevrCount. The former is built into SWIFT and can be used directly with `--reward_funcs format`. The latter needs to be custom-defined. Here, we use the `external_plugin` method to define the accuracy reward function by placing the code in `swift/examples/train/grpo/plugin/plugin.py`.
53+
54+
The reward function's input includes `completions` and `solution` fields, representing the model-generated text and ground truth, respectively. Each is a list, allowing the computation of multiple completions simultaneously. Note that the `solution` field is passed through directly from the dataset definition. If there are task changes, corresponding modifications can be made to the dataset and reward function.
55+
56+
```python
57+
class MultiModalAccuracyORM(ORM):
58+
59+
def __call__(self, completions, solution, **kwargs) -> List[float]:
60+
"""
61+
Reward function that checks if the completion is correct.
62+
Args:
63+
completions (list[str]): Generated outputs
64+
solution (list[str]): Ground Truths.
65+
66+
Returns:
67+
list[float]: Reward scores
68+
"""
69+
rewards = []
70+
from math_verify import parse, verify
71+
for content, sol in zip(completions, solution):
72+
reward = 0.0
73+
# Try symbolic verification first
74+
try:
75+
answer = parse(content)
76+
if float(verify(answer, parse(sol))) > 0:
77+
reward = 1.0
78+
except Exception:
79+
pass # Continue to next verification method if this fails
80+
81+
# If symbolic verification failed, try string matching
82+
if reward == 0.0:
83+
try:
84+
# Extract answer from solution if it has think/answer tags
85+
sol_match = re.search(r'<answer>(.*?)</answer>', sol)
86+
ground_truth = sol_match.group(1).strip() if sol_match else sol.strip()
87+
88+
# Extract answer from content if it has think/answer tags
89+
content_match = re.search(r'<answer>(.*?)</answer>', content)
90+
student_answer = content_match.group(1).strip() if content_match else content.strip()
91+
92+
# Compare the extracted answers
93+
if student_answer == ground_truth:
94+
reward = 1.0
95+
except Exception:
96+
pass # Keep reward as 0.0 if both methods fail
97+
rewards.append(reward)
98+
return rewards
99+
orms['external_r1v_acc'] = MultiModalAccuracyORM
100+
```
101+
102+
---
103+
104+
### **GRPO Training Experiment Log**
105+
106+
#### **Training Parameters**
107+
108+
We selected `Qwen2.5-VL-3B-Instruct` as the base model for training. The main reason for choosing the `Instruct` model over the base model is to rapidly achieve format rewards. Experiments were conducted on 8 GPUs. SWIFT GRPO training supports multi-GPU deployment to accelerate rollouts, so we set `num_infer_workers` to 2 and processes to 6 (2 GPUs for deployment, 6 GPUs for training). If you encounter deployment errors for `qwen2.5-vl` on `vllm`, refer to [this issue](https://github.com/vllm-project/vllm/issues/13285).
109+
110+
Since the task is simple, we set `max_completion_length` to 1024 and selected `external_r1v_acc` and `format` as reward functions. The learning rate and beta are set to `1e-6` and `0.001`, respectively. Other configurations are as follows. The settings for `batch_size` and `num_generations` can be referenced from [GRPO Full Workflow](./GRPO完整流程.md).
111+
112+
```shell
113+
WANDB_API_KEY=your_wandb_api_key \
114+
NPROC_PER_NODE=6 \
115+
swift rlhf \
116+
--rlhf_type grpo \
117+
--model Qwen/Qwen2.5-VL-3B-Instruct \
118+
--external_plugins examples/train/grpo/plugin/plugin.py \
119+
--reward_funcs external_r1v_acc format \
120+
--use_vllm true \
121+
--vllm_device auto \
122+
--vllm_gpu_memory_utilization 0.6 \
123+
--train_type full \
124+
--torch_dtype bfloat16 \
125+
--dataset 'okwinds/clevr_cogen_a_train' \
126+
--max_length 8192 \
127+
--max_completion_length 1024 \
128+
--num_train_epochs 1 \
129+
--per_device_train_batch_size 8 \
130+
--per_device_eval_batch_size 8 \
131+
--learning_rate 1e-6 \
132+
--gradient_accumulation_steps 2 \
133+
--save_strategy 'steps' \
134+
--eval_strategy 'steps' \
135+
--eval_steps 1000 \
136+
--save_steps 1000 \
137+
--save_total_limit 10 \
138+
--logging_steps 1 \
139+
--output_dir output/GRPO_CLEVR_COUNTDOWN \
140+
--warmup_ratio 0.01 \
141+
--dataloader_num_workers 4 \
142+
--num_generations 24 \
143+
--temperature 1.0 \
144+
--system 'examples/train/grpo/prompt.txt' \
145+
--deepspeed zero3 \
146+
--log_completions true \
147+
--vllm_max_model_len 1024 \
148+
--report_to wandb \
149+
--num_iterations 1 \
150+
--num_infer_workers 2 \
151+
--async_generate false \
152+
--beta 0.001 \
153+
```
154+
155+
#### **Experimental Observations**
156+
157+
[image.png](../../resources/grpo_clevr_count.png)
158+
159+
- Given the simplicity of the dataset and task, the model converged after 500 epochs. Key observations:
160+
1. The custom `ClevrORM` reward steadily increased, proving the model learned how to complete the task. The task success rate climbed from an initial 0.4 to nearly 1.
161+
2. The `Format Reward` remained stable at 1, likely due to the consistent query format across all dataset samples.
162+
3. The `reward_std` stabilized below 0.1.
163+
4. The `completion length` eventually stabilized between 60-80 tokens, with the model learning a fixed output pattern for item-by-item counting.
164+
165+
---
166+
For additional tasks like Geometric QA and Open R1 Multimodal datasets, refer to their respective sections in the full experiment documentation.
167+
168+
## **Geometric QA Task**
169+
170+
### **Task and Dataset Definition**
171+
172+
This task is a Geometric QA task, where the task description is: given a geometric figure, answer mathematical questions related to the figure. The original data comes from [this paper](https://arxiv.org/pdf/2312.11370), and [R1-V](https://github.com/Deep-Agent/R1-V.git) has preprocessed the data into a `problem-solution` format while retaining the images in the `image` field. Therefore, we do not need to redefine the dataset and can directly use `--dataset AI-ModelScope/GEOQA_R1V_Train_8K`.
173+
174+
---
175+
176+
### **Reward Function**
177+
178+
As this is also a mathematical problem, and the answers are already processed into final results, we directly use the previously defined `MultiModalAccuracyORM` reward function.
179+
180+
---
181+
182+
### **GRPO Training Experiment Log**
183+
184+
#### **Training Parameters**
185+
186+
The selected model and most hyperparameters are similar to the previous experiment, with two main differences:
187+
1. **SWIFT now supports the `--num_iteration` parameter**, allowing multiple updates during a single rollout. We set it to 2.
188+
2. During the experiment, we found that training might become unstable in mathematical problems, causing the model to collapse. This is characterized by a sharp drop in all rewards, a rapid increase in loss, `grad_norm`, and KL divergence, with no subsequent recovery. To prevent this, we set `--max_grad_norm 0.5` to ensure stable training. Note that this instability can have some randomness.
189+
190+
```shell
191+
WANDB_API_KEY=your_wandb_api_key \
192+
MAX_PIXELS=401408 \
193+
NPROC_PER_NODE=6 \
194+
swift rlhf \
195+
--rlhf_type grpo \
196+
--model Qwen/Qwen2.5-VL-3B-Instruct \
197+
--external_plugins examples/train/grpo/plugin/plugin.py \
198+
--reward_funcs external_r1v_acc format \
199+
--use_vllm true \
200+
--vllm_device auto \
201+
--vllm_gpu_memory_utilization 0.6 \
202+
--train_type full \
203+
--torch_dtype bfloat16 \
204+
--dataset 'AI-ModelScope/GEOQA_R1V_Train_8K' \
205+
--max_length 8192 \
206+
--max_completion_length 1024 \
207+
--num_train_epochs 1 \
208+
--per_device_train_batch_size 8 \
209+
--per_device_eval_batch_size 8 \
210+
--learning_rate 1e-6 \
211+
--gradient_accumulation_steps 2 \
212+
--save_strategy 'steps' \
213+
--eval_strategy 'steps' \
214+
--eval_steps 400 \
215+
--save_steps 400 \
216+
--save_total_limit 10 \
217+
--logging_steps 1 \
218+
--output_dir output/GRPO_GEOQA \
219+
--warmup_ratio 0.05 \
220+
--dataloader_num_workers 4 \
221+
--num_generations 8 \
222+
--temperature 1.0 \
223+
--repetition_penalty 1.1 \
224+
--system 'examples/train/grpo/prompt.txt' \
225+
--deepspeed zero3 \
226+
--log_completions true \
227+
--report_to wandb \
228+
--num_iterations 2 \
229+
--num_infer_workers 2 \
230+
--async_generate false \
231+
--beta 0.001 \
232+
--max_grad_norm 0.5 \
233+
```
234+
235+
---
236+
237+
#### **Experimental Observations**
238+
239+
[image.png](../../resources/grpo_geoqa.png)
240+
241+
The training curve is shown above.
242+
243+
1. Compared to the **Count task**, the **Geometric QA task** is significantly more challenging. As a result, the **Accuracy Reward** converges much more slowly and has larger oscillations.
244+
2. The `completion_length` exhibits a clear trend, eventually stabilizing at around 250 tokens.
245+
3. The reward converges, with the standard deviation stabilizing at around 0.2 after a total of 1300 steps.
246+
247+
Below is an example of the model's completion after training:
248+
249+
```plaintext
250+
User:
251+
"If AB is the tangent to circle O with the tangent point A, and BO intersects circle O at point C, and point D is located on circle O, what is the measure of angle ADC if the degree of angle ABO is 32.0?"
252+
Assistant:
253+
"<think>\nTo find the measure of angle ADC, we first need to understand the properties of the tangent and the circle. Since AB is a tangent to the circle at point A, the line segment OA (which is perpendicular to the tangent AB) is perpendicular to AB. This means that angle OAB is 90 degrees.\n\nGiven that angle ABO is 32 degrees, we can find angle BAO by subtracting angle ABO from 90 degrees:\n∠BAO = 90° - ∠ABO = 90° - 32° = 58°\n\nSince angle BAO is an angle in the minor arc AD, the angle ADC, which is an angle in the minor arc AD, is half the angle BAO. Therefore, angle ADC is:\n∠ADC = 1/2 × ∠BAO = 1/2 × 58° = 29°\n\nSo, the measure of angle ADC is 29 degrees.\n</think>\n<answer>\nThe measure of angle ADC is 29 degrees.\n</answer>"
254+
```
255+
256+
---
257+
258+
## **Multimodal Open R1 Dataset Experiment**
259+
260+
### **Task and Dataset Definition**
261+
262+
This task is based on the experiments in [open-r1-multimodal](https://github.com/EvolvingLMMs-Lab/open-r1-multimodal.git), using the dataset: [lmms-lab/multimodal-open-r1-8k-verified](https://www.modelscope.cn/datasets/lmms-lab/multimodal-open-r1-8k-verified). This dataset focuses on multimodal mathematical reasoning tasks, with data generated by GPT4o based on the `Math360K` and `Geo170K` datasets. It includes reasoning paths and verifiable answers. The dataset already contains `image`, `problem`, and `solution` fields, so no additional prompt modifications are required, and there is no need to redefine the dataset.
263+
264+
---
265+
266+
### **Reward Function**
267+
268+
We directly use the previously defined `MultiModalAccuracyORM` reward function.
269+
270+
---
271+
272+
### **GRPO Training Experiment Log**
273+
274+
#### **Training Parameters**
275+
276+
The selected model and most hyperparameters are similar to the previous experiment. Due to an **OOM (Out of Memory) issue**, we set `MAX_PIXELS=262144` to reduce memory usage.
277+
278+
```shell
279+
WANDB_API_KEY=your_wandb_api_key \
280+
MAX_PIXELS=262144 \
281+
MASTER_PORT=29600 \
282+
NPROC_PER_NODE=6 \
283+
swift rlhf \
284+
--rlhf_type grpo \
285+
--model Qwen/Qwen2.5-VL-3B-Instruct \
286+
--external_plugins examples/train/grpo/plugin/plugin.py \
287+
--reward_funcs external_r1v_acc format \
288+
--use_vllm true \
289+
--vllm_device auto \
290+
--vllm_gpu_memory_utilization 0.6 \
291+
--train_type full \
292+
--torch_dtype bfloat16 \
293+
--dataset 'lmms-lab/multimodal-open-r1-8k-verified' \
294+
--max_length 8192 \
295+
--max_completion_length 1024 \
296+
--num_train_epochs 1 \
297+
--per_device_train_batch_size 8 \
298+
--per_device_eval_batch_size 8 \
299+
--learning_rate 1e-6 \
300+
--gradient_accumulation_steps 2 \
301+
--save_strategy 'steps' \
302+
--eval_strategy 'steps' \
303+
--eval_steps 400 \
304+
--save_steps 400 \
305+
--save_total_limit 10 \
306+
--logging_steps 1 \
307+
--output_dir output/GRPO_GEOQA \
308+
--warmup_ratio 0.05 \
309+
--dataloader_num_workers 4 \
310+
--num_generations 8 \
311+
--temperature 1.0 \
312+
--repetition_penalty 1.1 \
313+
--system 'examples/train/grpo/prompt.txt' \
314+
--deepspeed zero3 \
315+
--log_completions true \
316+
--report_to wandb \
317+
--num_iterations 2 \
318+
--num_infer_workers 2 \
319+
--async_generate false \
320+
--beta 0.001 \
321+
--max_grad_norm 0.5 \
322+
```
323+
324+
---
325+
326+
#### **Experimental Observations**
327+
328+
[image.png](../../resources/grpo_openr1_multimodal.png)
329+
330+
The training curve is shown above.
331+
332+
- With a total of 8k data samples, the training reached **1268 steps**, with the **Accuracy Reward** converging around 0.5.
333+
- The `completion_length` stabilized around 200 tokens.
334+
- The `reward_std` eventually oscillated around 0.2.
335+
336+
Below is an example of the model's completion after training:
337+
338+
```plaintext
339+
User:
340+
"Based on the map, which state falls into the lowest percentage range of lots, and what is that percentage range?"
341+
Assistant:
342+
"<think>\nThe image provided shows a color-coded map of the USA indicating different ranges of lot percentages across various states. According to the legend at the bottom, the lowest percentage range (6.0% - 6.5%) is represented by white. In the image, Alabama (AL) is shaded in white, which corresponds to the 6.0% - 6.5% category. Therefore, based on the map, the state that falls into the lowest percentage range of lots is Alabama, with the percentage range of 6.0% - 6.5%.\nTherefore, the answer is 6.0% - 6.5%.\n</think>\n<answer>Alabama</answer>"
343+
```

0 commit comments

Comments
 (0)