Replies: 2 comments
-
Here is initial implementation that I believe should match the algorithm from the paper: #4810 Still not 100% sure if correct, but the passkey test works with a big context. Haven't done PPL tests yet, but the implementation is very simple and it avoids the double dot product that is performed in the implementation from the paper. So it should be more efficient Also added to |
Beta Was this translation helpful? Give feedback.
-
Interesting, the PPL looks even better than SWA.
Nvm, it's done by non-overlapping blocks. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
https://arxiv.org/abs/2401.01325
The proposed algorithm
Perplexity results
It appears pretty simple to implement and promises good results, so perhaps it's worth giving it a try. Authors limited themselves to 16k context lengths due to hardware constraints, so it's not clear how well this scales beyond 16k.
Beta Was this translation helpful? Give feedback.
All reactions