Implement block copy kernel to optimize beam search #32

WoosukKwon · 2023-04-07T21:01:12Z

This PR implements a block copy kernel. By using this kernel, we can reduce the number of kernel invocations from 2 * num_layers * num_copying_blocks to 1.

…u8-desc [CPU] Add comment for u8 kvcache layout

* add gaudi installation readme * readme writeup * Create README_GAUDI.md * Update README.md * Update README_GAUDI.md * Update README.md * Update readmes

Update linear.py

Adds support for multi-lora adapters. Passing tests added over in this PR: https://github.ibm.com/ai-foundation/tgis-deploy-tests/pull/25/files --------- Signed-off-by: Joe Runde <[email protected]>

A warning will be printed out if this case is triggered: ``` WARNING 02-20 22:21:27 sparse_w16a16.py:32] Unstructured sparse kernels are not optimized for NVIDIA SM < 8.0. Naive decompress kernels will be used and can be slower than dense models ``` Works on a T4 with: ```python from vllm import LLM, SamplingParams model = LLM( "nm-testing/opt-125m-pruned2.4", sparsity="sparse_w16a16", enforce_eager=True, dtype="float16", ) sampling_params = SamplingParams(max_tokens=100, temperature=0) outputs = model.generate("Hello my name is", sampling_params=sampling_params) outputs[0].outputs[0].text ``` Test within colab: https://colab.research.google.com/drive/15xRvWX5gNaTb00BcaXhxwMm6yxavIKGN?usp=sharing

…ache_weight Enable AWQ to use int8 weight first token as default

WoosukKwon added 4 commits April 7, 2023 20:57

Replace cudamemcpy with custom copy kernel

264b698

Add test

cdc3ded

Minor optimization for beam search

73aa5ab

Add sampling-related arguments to latency benchmark

b84108e

WoosukKwon merged commit 0f40557 into main Apr 8, 2023

WoosukKwon deleted the copy branch April 8, 2023 00:45

shanshanpt mentioned this pull request Nov 17, 2023

Run long conetxt error : CUDA error: an illegal memory access was encountered #1700

Closed

junior-zsy mentioned this pull request Nov 20, 2023

Error with 32k Long Text in chatglm2-6b-32k Model #1725

Closed

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Implement block copy kernel to optimize beam search (vllm-project#32)

c3c6d98

slyalin pushed a commit to slyalin/vllm that referenced this pull request May 6, 2024

Merge pull request vllm-project#32 from luo-cheng2021/luocheng/pa-kv-…

2e5648a

…u8-desc [CPU] Add comment for u8 kvcache layout

yuhuixu1993 mentioned this pull request Jun 2, 2024

[Bug]: loading squeezellm model #5190

Closed

fxmarty pushed a commit to fxmarty/vllm-public that referenced this pull request Jun 12, 2024

Merge pull request vllm-project#32 from ROCm/gshtras-patch-1

69ce080

Update linear.py

ZHJ19970917 mentioned this pull request Jul 14, 2024

[Bug]: When using qwen-32b-chat-awq with multi-threaded access, errors occur after approximately several hundred visits.”vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already.“ #6421

Closed

bigPYJ1151 pushed a commit to bigPYJ1151/vllm that referenced this pull request Jul 31, 2024

Merge pull request vllm-project#32 from intel-sandbox/jianan/enable_c…

0415460

…ache_weight Enable AWQ to use int8 weight first token as default

alixiaodi mentioned this pull request Aug 2, 2024

[Bug]: #7072

Closed

sgsdxzy mentioned this pull request May 4, 2025

[Bug]: Unable to run Qwen3 on Turing GPUs after upgrading to torch 2.7.0 #17639

Open

1 task

hao-cold mentioned this pull request May 13, 2025

[Bug]: CUDA error: an illegal instruction was encountered #18045

Closed

1 task

markmc mentioned this pull request May 21, 2025

[Bug][Failing Test]: Distributed Comm Ops - distributed/test_shm_broadcast.py #18492

Closed

1 task

zerosurplus mentioned this pull request Jun 16, 2025

[Bug]: torch.distributed.DistNetworkError: The client socket has timed out after 600000ms while trying to connect to (172.17.0.9, 46229). #19670

Open

1 task

tlrmchlsmth mentioned this pull request Jul 10, 2025

[Model] New model support for microsoft/Phi-4-mini-flash-reasoning #20702

Merged

4 tasks

xiaomofang mentioned this pull request Jul 31, 2025

[Bug]: There is an issue with speculative inference in Eagle mode, where the context length of vLLM inference is constrained by the draft model. #21986

Open

1 task

zyongye pushed a commit to zyongye/vllm that referenced this pull request Aug 5, 2025

Fix truncapted output for Responses API (vllm-project#32)

a99f3c1

zyongye pushed a commit to zyongye/vllm that referenced this pull request Aug 6, 2025

Fix truncapted output for Responses API (vllm-project#32)

436be97

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Implement block copy kernel to optimize beam search #32

Implement block copy kernel to optimize beam search #32

Uh oh!

WoosukKwon commented Apr 7, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Implement block copy kernel to optimize beam search #32

Implement block copy kernel to optimize beam search #32

Uh oh!

Conversation

WoosukKwon commented Apr 7, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant