Skip to content
Discussion options

You must be logged in to vote

You would in principle get better performance if you process the prompts in a batch. The CUDA backend already has an optimization for FlashAttention that skips the fully masked out tails of sequences. However, in practice you will not get that much of a speedup unless the prompts are very short. To get a feel for this, try running llama-bench for some constant number of tokens while varying the batch size. Also the more prompt tokens you process in a batch the less total time you need but the longer the delivery of new tokens to preexisting requests is interrupted at a time.

Replies: 4 comments 1 reply

Comment options

You must be logged in to vote
0 replies
Comment options

hipudding
Sep 14, 2025
Collaborator Author

You must be logged in to vote
1 reply
@hipudding
Comment options

hipudding Sep 14, 2025
Collaborator Author

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Answer selected by hipudding
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
4 participants