Spans block duplication #92

nataxcan · 2025-09-23T08:54:37Z

This PR implements block duplication.
Previously, span cache-hits were treated the same way prefix caching does it: same memory reference, different request being served.
But while the same memory reference contains the same semantic content (it refers to KV vectors of the same input tokens), it cannot contain two different positional encodings at once.
So, instead of using the same memory reference, we let vLLM allocate new blocks, but instead of prefilling those blocks we copy the KV vectors with adjusted (if needed) positional encodings.

Co-authored-by: Nathan Ordonez <[email protected]> Signed-off-by: Thomas Parnell <[email protected]> Signed-off-by: Nathan Ordonez <[email protected]>

Signed-off-by: Nathan Ordonez <[email protected]>

An initial implementation of span semantics in vLLM. Please note that this has a known bug dealing with concurrent sequences that re-use the same span in different locations. We are working on a solution for this, but in the meantime accuracy may be negatively affected. n/a n/a --- <details> <summary> Essential Elements of an Effective PR Description Checklist </summary> - [x] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)". - [ ] The test plan, such as providing test command. - [ ] The test results, such as pasting the results comparison before and after, or e2e results - [ ] (Optional) The necessary documentation update, such as updating `supported_models.md` and `examples` for a new model. - [ ] (Optional) Release notes update. If your change is user facing, please update the release notes draft in the [Google Doc](https://docs.google.com/document/d/1YyVqrgX4gHTtrstbq8oWUImOyPCKSGnJ7xtTpmXzlRs/edit?tab=t.0). </details> --------- Signed-off-by: Thomas Parnell <[email protected]> Signed-off-by: Nathan Ordonez <[email protected]> Co-authored-by: Nathan Ordonez <[email protected]> Co-authored-by: Nathan Ordonez <[email protected]>

… benefits) Signed-off-by: Nathan Ordonez <[email protected]>

Signed-off-by: Nathan Ordonez <[email protected]>

nataxcan · 2025-09-24T10:47:24Z

currently running evals to check accuracy...

Signed-off-by: Nathan Ordonez <[email protected]>

starpit · 2025-09-24T17:25:02Z

i see that this addresses the regression in SPANS_DEBUG. but the hashes printed out aren't useful. to turn bytes into a string, i think we need something like?

# just for example...
b = bytes.fromhex("abcd1234")

# turn b back into the string "abcd":
from binascii import hexlify
def pretty(b):
  return hexlify(b).decode('utf-8')[:4]

pretty(b)
abcd

whereas this PR uses str(b)[:4] which produces pretty much garbage output.

nataxcan force-pushed the spans-block-duplication branch from 5b31ef2 to 919e73c Compare September 24, 2025 10:10

tdoublep and others added 7 commits September 24, 2025 06:17

Initial supports for spans/block-attention.

ef59a8f

Co-authored-by: Nathan Ordonez <[email protected]> Signed-off-by: Thomas Parnell <[email protected]> Signed-off-by: Nathan Ordonez <[email protected]>

initial impl (runs, but accuracy dropped)

6491b42

Signed-off-by: Nathan Ordonez <[email protected]>

bug fix (block duplication seems to work)

7a4e46b

Signed-off-by: Nathan Ordonez <[email protected]>

bugfix repositioning

92812ef

Signed-off-by: Nathan Ordonez <[email protected]>

bugfix, benefits now show up (and including benchmark that shows said…

4f5c00f

… benefits) Signed-off-by: Nathan Ordonez <[email protected]>

development folder

3998c6f

Signed-off-by: Nathan Ordonez <[email protected]>

nataxcan force-pushed the spans-block-duplication branch from 919e73c to 3998c6f Compare September 24, 2025 10:27

nataxcan added 2 commits September 24, 2025 06:33

Merge branch 'main' into spans-block-duplication

3d84e53

Signed-off-by: Nathan Ordonez <[email protected]>

Merge branch 'main' into spans-block-duplication

ffcd788

Signed-off-by: Nathan Ordonez <[email protected]>

bugfix

cdae9f9

Signed-off-by: Nathan Ordonez <[email protected]>

speed optimizations (from 6x to 1.3x overhead)

116b457

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Spans block duplication #92

Spans block duplication #92

Uh oh!

nataxcan commented Sep 23, 2025 •

edited by github-actions bot

Loading

Uh oh!

nataxcan commented Sep 24, 2025

Uh oh!

starpit commented Sep 24, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Spans block duplication #92

Are you sure you want to change the base?

Spans block duplication #92

Uh oh!

Conversation

nataxcan commented Sep 23, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nataxcan commented Sep 24, 2025

Uh oh!

starpit commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nataxcan commented Sep 23, 2025 •

edited by github-actions bot

Loading

starpit commented Sep 24, 2025 •

edited

Loading