[core][optimization] use a pool of numpy ndarray to hold seq data #5942

youkaichao · 2024-06-27T23:58:21Z

The remaining part of #5877 after separating #5882 out.

the same benchmark command:

python benchmarks/benchmark_throughput.py --output-len 256 --input 256 --model meta-llama/Llama-2-7b-hf -tp 8

the same machine: 8*H100

before (current main): Throughput: 38.89 requests/s, 19909.29 tokens/s

after (this PR): Throughput: 40.12 requests/s, 20541.11 tokens/s

comaniac · 2024-06-28T00:35:01Z

vllm/sequence.py

        output_token_ids: Optional[List[int]] = None,
    ) -> None:
+        self.tokens = _SEQUENCE_DATA_POOL.alloc_array()
+        self.prompt_token_ids_list = prompt_token_ids


Is there any opportunity to get rid of this list (and output token ids list)? This is completely duplicated to the numpy arrays and we should avoid that as possible.

I want to delete it, too. However, sometimes we need to get the list of int of prompt token ids because users want list of int. If we don't store it here, we need to create a copy from numpy array, which is expensive.

Fortunately, this is just a reference, performance-wise it is fine.

I searched the code base and seems like only batch expansion uses get_prompt_token_ids() and get_output_token_ids(), so it should be possible, as batch expansion is going to be removed by @LiuXiaoxuanPKU

good to know.

comaniac · 2024-06-28T00:37:39Z

vllm/sequence.py


    def append_token_id(self, token_id: int, logprob: float) -> None:
-        self.output_token_ids.append(token_id)
+        self.tokens[self.num_prompt_tokens + self.num_output_tokens] = token_id


Ideally we should have an assertion to check the boundary, even 128k should always be sufficient atm. Let's add an assert if it doesn't hurt performance; otherwise we could just comment that we assume the context length won't go beyond 128k.

I think numpy array indexing already has boundary check.

I want to somehow know the max seq length in seqdata, but don't know how to pass that information across so many levels.

Setting a fixed length makes sense to me considering the simplicity. Hmm maybe it's ok to keep the current implementation then. If someone really hits the boundary and see the numpy error, we could know what's going on...

github-actions · 2024-10-25T02:03:38Z

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

github-actions · 2024-11-24T02:09:11Z

This pull request has been automatically closed due to inactivity. Please feel free to reopen if you intend to continue working on it. Thank you!

youkaichao added 4 commits June 27, 2024 16:10

prompt as list, whole sequence as ndarray

4f58917

use separate list

e864ed7

use ndarray for input tokens

713de37

add comments

66f5154

youkaichao mentioned this pull request Jun 27, 2024

[core][optimization] use a pool of numpy ndarray to hold seq data #5877

Closed

comaniac reviewed Jun 28, 2024

View reviewed changes

youkaichao added 10 commits June 27, 2024 18:51

use max seq len

567850d

use 16k by default

d2defac

pass max seq len from llm engine

1ebebdb

fix type

12f1d54

optimize input_positions

eead604

try to fix

d1db7bd

move pool out

0470012

change blocks to array

9271190

fix more types

58da18d

block table

c5b2926

comaniac mentioned this pull request Jun 28, 2024

[Draft][Core] Refactor _prepare_model_input_tensors #5972

Closed

github-actions bot added the stale Over 90 days of inactivity label Oct 25, 2024

github-actions bot closed this Nov 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[core][optimization] use a pool of numpy ndarray to hold seq data #5942

[core][optimization] use a pool of numpy ndarray to hold seq data #5942

Uh oh!

youkaichao commented Jun 27, 2024 •

edited

Loading

Uh oh!

comaniac Jun 28, 2024

Uh oh!

youkaichao Jun 28, 2024

Uh oh!

comaniac Jun 28, 2024

Uh oh!

youkaichao Jun 28, 2024

Uh oh!

comaniac Jun 28, 2024

Uh oh!

youkaichao Jun 28, 2024

Uh oh!

youkaichao Jun 28, 2024

Uh oh!

comaniac Jun 28, 2024

Uh oh!

github-actions bot commented Oct 25, 2024

Uh oh!

github-actions bot commented Nov 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

[core][optimization] use a pool of numpy ndarray to hold seq data #5942

[core][optimization] use a pool of numpy ndarray to hold seq data #5942

Uh oh!

Conversation

youkaichao commented Jun 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 25, 2024

Uh oh!

github-actions bot commented Nov 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

youkaichao commented Jun 27, 2024 •

edited

Loading