Skip to content

Conversation

ywang96
Copy link
Member

@ywang96 ywang96 commented Sep 11, 2025

Purpose

This PR adds support for rednote-hilab/dots.ocr. This model is currently supported via OOT registration but we might as well bring it into vLLM so that users don't need to set it up with additonal steps.

Most of the codes taken are from the model repo, but this PR also cleans up a few logics that are no longer needed since this model does not support video modality.

FIXES #24581

Co-authored-by @yinz-aizip

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Roger Wang <[email protected]>
@mergify mergify bot added documentation Improvements or additions to documentation new-model Requests to new models labels Sep 11, 2025
Signed-off-by: Roger Wang <[email protected]>
@casper-hansen
Copy link
Contributor

Thanks for taking a stab at upstreaming this @ywang96. We need more and better OCR models, and this would be a great step forward.

@ywang96
Copy link
Member Author

ywang96 commented Sep 13, 2025

Thanks for taking a stab at upstreaming this @ywang96. We need more and better OCR models, and this would be a great step forward.

@casper-hansen No problem. BTW there were still some performance issues I still need to debug, but correctness-wise, this branch should be actually ready to go with vllm serve rednote-hilab/dots.ocr --trust-remote-code.

@yinz-aizip yinz-aizip mentioned this pull request Sep 14, 2025
5 tasks
@ywang96
Copy link
Member Author

ywang96 commented Sep 17, 2025

@yinz-aizip I made some changes to use vllm internal layers - could you help verify the correctness of this implementation (similarly to what you did in your PR)? Thanks!

@ywang96 ywang96 marked this pull request as ready for review September 17, 2025 10:21
@ywang96 ywang96 requested a review from hmellor as a code owner September 17, 2025 10:21
@yinz-aizip
Copy link
Contributor

Summary

This PR compares performance and evaluation metrics between two commits:
• Commit 96cc9bc
• Commit c830194

The model was served with the following configuration:

hf_model_path='/path/to/dots.ocr'
export CUDA_VISIBLE_DEVICES=4,5,6,7
vllm serve $hf_model_path \
    --host 127.0.0.1 \
    --port 8126 \
    --data-parallel-size 4 \
    --tensor-parallel-size 1 \
    --gpu-memory-utilization 0.8 \
    --chat-template-content-format string \
    --served-model-name model \
    --trust-remote-code

Efficiency

Throughput was estimated by running 1,000 concurrent requests on a single image.
• Commit 96cc9bc... → 3980.84
• Commit c830194... → 3743.34

The newer commit is slightly slower, though the difference is relatively minor.

Effectiveness

Evaluated on OmniDocBenchmark (lower is better):
• Commit 96cc9bc...
• overall_EN: 0.12578
• overall_CH: 0.16556
• Commit c830194...
• overall_EN: 0.12412
• overall_CH: 0.16285

The newer commit shows a small improvement in accuracy across both English and Chinese benchmarks.

Conclusion

•	Efficiency: Slightly reduced in the newer commit.
•	Effectiveness: Marginally improved results on OmniDocBenchmark.

Overall, the trade-off seems acceptable, with minor throughput loss balanced by better benchmark performance.

@casper-hansen
Copy link
Contributor

@ywang96 I tested this PR:

  • Correctness issues fixed (no more endless generations)
  • Performance issues also fixed, now scales with number of concurrent images.

1x H100 concurrency benchmark:

  • 1.81 images/second for 16 images.
  • 2.36 images/second for 30 images.

@casper-hansen
Copy link
Contributor

One potential performance issue: When I pass in 30 images, I see the below message 30 times before images are actually processed. This seems to add a big latency for this model.

Fetching 1 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 7869.24it/s]

@ywang96
Copy link
Member Author

ywang96 commented Sep 18, 2025

One potential performance issue: When I pass in 30 images, I see the below message 30 times before images are actually processed. This seems to add a big latency for this model.

Fetching 1 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 7869.24it/s]

Hmm - FYI @DarkLight1337, I'm wondering if this has something to do with a out-of-tree processor (i.e one with trust_remote_code=True) inheriting from a HF one?

"ChameleonForConditionalGeneration": ("chameleon", "ChameleonForConditionalGeneration"), # noqa: E501
"Cohere2VisionForConditionalGeneration": ("cohere2_vision", "Cohere2VisionForConditionalGeneration"), # noqa: E501
"DeepseekVLV2ForCausalLM": ("deepseek_vl2", "DeepseekVLV2ForCausalLM"),
"DotsOCRForCausalLM": ("dots_ocr", "DotsOCRForCausalLM"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def forward(self, x: torch.Tensor) -> torch.Tensor:
x1, _ = self.fc1(x)
x3, _ = self.fc3(x)
x = F.silu(x1) * x3
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use MergedColumnParallelLinear here ?

num_heads, self.tp_size)

# qkv/proj follow Qwen2-VL style; bias controlled by arg
self.qkv = ColumnParallelLinear(input_size=dim,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can use QKVParallelLinear.

@jeejeelee
Copy link
Collaborator

One potential performance issue: When I pass in 30 images, I see the below message 30 times before images are actually processed. This seems to add a big latency for this model.

Fetching 1 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 7869.24it/s]

Hmm - FYI @DarkLight1337, I'm wondering if this has something to do with a out-of-tree processor (i.e one with trust_remote_code=True) inheriting from a HF one?

I tested it with vision_language.py and didn't see similar output, which is a bit strange.

@casper-hansen
Copy link
Contributor

@ywang96 @yinz-aizip is it possible to avoid --trust-remote-code? I believe this is the root cause of the high latency.

@ywang96
Copy link
Member Author

ywang96 commented Sep 18, 2025

I believe this is the root cause of the high latency.

@casper-hansen Yea I think so too - but I don't think it should be removed but instead we should cache the object that fetches the remote file (instead of doing it over and over) - This should not happen and I need to debug why this is happening 😅

@DarkLight1337
Copy link
Member

The repeated loading issue should be fixed by #25341

@ywang96
Copy link
Member Author

ywang96 commented Sep 21, 2025

The correctness of this PR has been verified by our contact from rednote engineering team so I'm just going to turn on ready label for it.

@ywang96 ywang96 added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 21, 2025
Copy link
Collaborator

@jeejeelee jeejeelee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, some improvements can be completed in subsequent PRs

@ywang96 ywang96 enabled auto-merge (squash) September 21, 2025 19:40
Signed-off-by: Roger Wang <[email protected]>
@ywang96 ywang96 merged commit 7b57a43 into vllm-project:main Sep 22, 2025
49 checks passed
kingsmad pushed a commit to kingsmad/vllm that referenced this pull request Sep 22, 2025
Signed-off-by: Roger Wang <[email protected]>
Co-authored-by: yinz-aizip <[email protected]>
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
Signed-off-by: Roger Wang <[email protected]>
Co-authored-by: yinz-aizip <[email protected]>
charlifu pushed a commit to ROCm/vllm that referenced this pull request Sep 25, 2025
Signed-off-by: Roger Wang <[email protected]>
Co-authored-by: yinz-aizip <[email protected]>
Signed-off-by: charlifu <[email protected]>
yewentao256 pushed a commit that referenced this pull request Oct 3, 2025
Signed-off-by: Roger Wang <[email protected]>
Co-authored-by: yinz-aizip <[email protected]>
Signed-off-by: yewentao256 <[email protected]>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
Signed-off-by: Roger Wang <[email protected]>
Co-authored-by: yinz-aizip <[email protected]>
Signed-off-by: xuebwang-amd <[email protected]>
choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025
Signed-off-by: Roger Wang <[email protected]>
Co-authored-by: yinz-aizip <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation new-model Requests to new models ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[New Model]: dots_ocr by rednote

5 participants