Skip to content

Conversation

WoosukKwon
Copy link
Collaborator

This PR implements a block copy kernel. By using this kernel, we can reduce the number of kernel invocations from 2 * num_layers * num_copying_blocks to 1.

@WoosukKwon WoosukKwon merged commit 0f40557 into main Apr 8, 2023
@WoosukKwon WoosukKwon deleted the copy branch April 8, 2023 00:45
hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024
slyalin pushed a commit to slyalin/vllm that referenced this pull request May 6, 2024
…u8-desc

[CPU] Add comment for u8 kvcache layout
tianyil1 pushed a commit to tianyil1/vllm that referenced this pull request Jun 5, 2024
* add gaudi installation readme

* readme writeup

* Create README_GAUDI.md

* Update README.md

* Update README_GAUDI.md

* Update README.md

* Update readmes
fxmarty pushed a commit to fxmarty/vllm-public that referenced this pull request Jun 12, 2024
joerunde added a commit to joerunde/vllm that referenced this pull request Jun 17, 2024
Adds support for multi-lora adapters.

Passing tests added over in this PR:
https://github.ibm.com/ai-foundation/tgis-deploy-tests/pull/25/files

---------

Signed-off-by: Joe Runde <[email protected]>
yukavio pushed a commit to yukavio/vllm that referenced this pull request Jul 3, 2024
A warning will be printed out if this case is triggered:
```
WARNING 02-20 22:21:27 sparse_w16a16.py:32] Unstructured sparse kernels are not optimized for NVIDIA SM < 8.0. Naive decompress kernels will be used and can be slower than dense models
```

Works on a T4 with:
```python
from vllm import LLM, SamplingParams

model = LLM(
    "nm-testing/opt-125m-pruned2.4", 
    sparsity="sparse_w16a16",
    enforce_eager=True,
    dtype="float16",
)

sampling_params = SamplingParams(max_tokens=100, temperature=0)
outputs = model.generate("Hello my name is", sampling_params=sampling_params)
outputs[0].outputs[0].text
```

Test within colab:
https://colab.research.google.com/drive/15xRvWX5gNaTb00BcaXhxwMm6yxavIKGN?usp=sharing
yukavio pushed a commit to yukavio/vllm that referenced this pull request Jul 3, 2024
A warning will be printed out if this case is triggered:
```
WARNING 02-20 22:21:27 sparse_w16a16.py:32] Unstructured sparse kernels are not optimized for NVIDIA SM < 8.0. Naive decompress kernels will be used and can be slower than dense models
```

Works on a T4 with:
```python
from vllm import LLM, SamplingParams

model = LLM(
    "nm-testing/opt-125m-pruned2.4", 
    sparsity="sparse_w16a16",
    enforce_eager=True,
    dtype="float16",
)

sampling_params = SamplingParams(max_tokens=100, temperature=0)
outputs = model.generate("Hello my name is", sampling_params=sampling_params)
outputs[0].outputs[0].text
```

Test within colab:
https://colab.research.google.com/drive/15xRvWX5gNaTb00BcaXhxwMm6yxavIKGN?usp=sharing
bigPYJ1151 pushed a commit to bigPYJ1151/vllm that referenced this pull request Jul 31, 2024
…ache_weight

Enable AWQ to use int8 weight first token as default
@alixiaodi alixiaodi mentioned this pull request Aug 2, 2024
zyongye pushed a commit to zyongye/vllm that referenced this pull request Aug 5, 2025
zyongye pushed a commit to zyongye/vllm that referenced this pull request Aug 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant