Suspicious system memory leak when using colocate mode and VLLM==0.7.3

Hi guys,

I use grpo colocate mode to train qwen25VL 7B. But during training, this continuous system memory increase was observed. I suspect that some kind of memory leak happened here.

the GREEN line is system memory usage.
![Image](https://github.com/user-attachments/assets/90b02b10-941d-4b46-a366-ff7b4eb2ce32)

here is my sh command:
NPROC_PER_NODE=8 \
MAX_PIXELS=1280000 \
swift rlhf \
    --rlhf_type grpo \
    --model /mnt2/models/Qwen2.5_VL_7B_Instruct \
    --train_type full \
    --dataset /ossfs/workspace/xxx.json \
    --torch_dtype bfloat16 \
    --num_train_epochs 1 \
    --max_length 4096 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 2 \
    --eval_steps 1000 \
    --save_steps 1000 \
    --eval_strategy 'no' \
    --learning_rate 1e-6 \
    --save_total_limit 2 \
    --logging_steps 1 \
    --output_dir /mnt2/user//outputs/\
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \
    --max_completion_length 1024 \
    --external_plugins examples/train/grpo/plugin/plugin.py \
    --reward_funcs external_ui_acc uiformat \
    --num_generations 16 \
    --use_vllm true \
    --vllm_gpu_memory_utilization 0.2 \
    --vllm_max_model_len 5120 \
    --deepspeed zero3_offload \
    --temperature 1.1 \
    --log_completions true \
    --num_infer_workers 8 \
    --tensor_parallel_size 4 \
    --async_generate false \
    --sleep_level 1 \
    --report_to swanlab \


here is my related libraries:
vllm 0.7.3
trl 0.16.0.dev0
transformers 4.49.0
torch 2.5.1+cu121

Btw, I also found that someone encountered system memory leak issues in other open-source project, when using vllm==0.7.3. So I guess something go wrong with specific version of vllm:
https://github.com/hiyouga/EasyR1/issues/50


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Suspicious system memory leak when using colocate mode and VLLM==0.7.3 #3508

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Suspicious system memory leak when using colocate mode and VLLM==0.7.3 #3508

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions