-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[Model] Support Dots OCR #24645
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Model] Support Dots OCR #24645
Conversation
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Thanks for taking a stab at upstreaming this @ywang96. We need more and better OCR models, and this would be a great step forward. |
@casper-hansen No problem. BTW there were still some performance issues I still need to debug, but correctness-wise, this branch should be actually ready to go with |
c9572ff
to
d2f6ad2
Compare
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
@yinz-aizip I made some changes to use vllm internal layers - could you help verify the correctness of this implementation (similarly to what you did in your PR)? Thanks! |
SummaryThis PR compares performance and evaluation metrics between two commits: The model was served with the following configuration: hf_model_path='/path/to/dots.ocr'
export CUDA_VISIBLE_DEVICES=4,5,6,7
vllm serve $hf_model_path \
--host 127.0.0.1 \
--port 8126 \
--data-parallel-size 4 \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.8 \
--chat-template-content-format string \
--served-model-name model \
--trust-remote-code EfficiencyThroughput was estimated by running 1,000 concurrent requests on a single image. The newer commit is slightly slower, though the difference is relatively minor. EffectivenessEvaluated on OmniDocBenchmark (lower is better): The newer commit shows a small improvement in accuracy across both English and Chinese benchmarks. Conclusion
Overall, the trade-off seems acceptable, with minor throughput loss balanced by better benchmark performance. |
@ywang96 I tested this PR:
1x H100 concurrency benchmark:
|
One potential performance issue: When I pass in 30 images, I see the below message 30 times before images are actually processed. This seems to add a big latency for this model.
|
Hmm - FYI @DarkLight1337, I'm wondering if this has something to do with a out-of-tree processor (i.e one with |
"ChameleonForConditionalGeneration": ("chameleon", "ChameleonForConditionalGeneration"), # noqa: E501 | ||
"Cohere2VisionForConditionalGeneration": ("cohere2_vision", "Cohere2VisionForConditionalGeneration"), # noqa: E501 | ||
"DeepseekVLV2ForCausalLM": ("deepseek_vl2", "DeepseekVLV2ForCausalLM"), | ||
"DotsOCRForCausalLM": ("dots_ocr", "DotsOCRForCausalLM"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We aslo need to add this model in https://github.com/vllm-project/vllm/blob/main/tests/models/registry.py
def forward(self, x: torch.Tensor) -> torch.Tensor: | ||
x1, _ = self.fc1(x) | ||
x3, _ = self.fc3(x) | ||
x = F.silu(x1) * x3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use MergedColumnParallelLinear
here ?
num_heads, self.tp_size) | ||
|
||
# qkv/proj follow Qwen2-VL style; bias controlled by arg | ||
self.qkv = ColumnParallelLinear(input_size=dim, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can use QKVParallelLinear
.
I tested it with |
@ywang96 @yinz-aizip is it possible to avoid --trust-remote-code? I believe this is the root cause of the high latency. |
@casper-hansen Yea I think so too - but I don't think it should be removed but instead we should cache the object that fetches the remote file (instead of doing it over and over) - This should not happen and I need to debug why this is happening 😅 |
Signed-off-by: Roger Wang <[email protected]>
The repeated loading issue should be fixed by #25341 |
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
The correctness of this PR has been verified by our contact from rednote engineering team so I'm just going to turn on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, some improvements can be completed in subsequent PRs
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]> Co-authored-by: yinz-aizip <[email protected]>
Signed-off-by: Roger Wang <[email protected]> Co-authored-by: yinz-aizip <[email protected]>
Signed-off-by: Roger Wang <[email protected]> Co-authored-by: yinz-aizip <[email protected]> Signed-off-by: charlifu <[email protected]>
Signed-off-by: Roger Wang <[email protected]> Co-authored-by: yinz-aizip <[email protected]> Signed-off-by: yewentao256 <[email protected]>
Signed-off-by: Roger Wang <[email protected]> Co-authored-by: yinz-aizip <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
Signed-off-by: Roger Wang <[email protected]> Co-authored-by: yinz-aizip <[email protected]>
Purpose
This PR adds support for rednote-hilab/dots.ocr. This model is currently supported via OOT registration but we might as well bring it into vLLM so that users don't need to set it up with additonal steps.
Most of the codes taken are from the model repo, but this PR also cleans up a few logics that are no longer needed since this model does not support video modality.
FIXES #24581
Co-authored-by @yinz-aizip
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.