[core][optimization] use a pool of numpy ndarray to hold seq data #5877
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
similar to #5584
the same benchmark command:
python benchmarks/benchmark_throughput.py --output-len 256 --input 256 --model meta-llama/Llama-2-7b-hf -tp 8
the same machine: 8*H100
before (current main): Throughput: 38.07 requests/s, 19493.23 tokens/s
after (this PR): Throughput: 38.94 requests/s, 19939.65 tokens/s
let's see if it breaks anything. we need to make sure, we only use python list when receiving/sending user's request. elsewhere, we should keep numpy array, where slicing is only a view operation. Never copy the whole sequence.