Interesting paper - LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning #4785

Galunid · 2024-01-05T15:05:55Z

Galunid
Jan 5, 2024
Collaborator

We propose Self-Extend to elicit LLMs’ inherent long context capabilities. To overcome the positional O.O.D issue, Self-Extend uses the simple FLOOR (//) operation as the mapping function to map unseen large relative positions to those encountered during pretraining. This idea stems from two intuitions: 1) For texts with a long distance between words, the exact position does not need to be precise. It is sufficient to understand the overall meaning of the text as long as the relative ordering of the different parts is maintained. When answering a question about information from a lengthy text, we never remember the recise position of each word, just the general position and order of the relevant information. Since natural language texts tend to have similar semantics within a short range (e.g. a paragraph), close or even equal position encodings should be adequate for maintaining the relative ordering of useful information. This aligns with the floor operation. 2) In natural language texts, most of the time, while a small bag of words (n-grams) appears together in one area, all the tokens in that bag have only one possible order due to the conventions of the language grammar. Although theoretically, a bag of tokens could appear in any order, in practice it is rare for a small set of words to have more than one sensible ordering. For example, ”unnecessary encodings” can be tokenized as ”unn”, ”ecessary”, ” enc” and ”odings”2, but these tokens can only meaningfully appear in that order. This suggests that maintaining precise position information is unnecessary in a small region, which also aligns with the floor operation.

The proposed algorithm

Perplexity results

It appears pretty simple to implement and promises good results, so perhaps it's worth giving it a try. Authors limited themselves to 16k context lengths due to hardware constraints, so it's not clear how well this scales beyond 16k.

ggerganov · 2024-01-07T15:34:17Z

ggerganov
Jan 7, 2024
Maintainer

Here is initial implementation that I believe should match the algorithm from the paper: #4810

Still not 100% sure if correct, but the passkey test works with a big context. Haven't done PPL tests yet, but the implementation is very simple and it avoids the double dot product that is performed in the implementation from the paper. So it should be more efficient

Also added to main example: #4815

0 replies

bullno1 · 2024-01-08T02:39:32Z

bullno1
Jan 8, 2024

Interesting, the PPL looks even better than SWA.

~~Intuitively, does this mean further tokens are less important but still kinda-sorta attended to?~~
~~Since the cache keeps getting compressed.~~

~~SWA feels like it works that way too.~~

Nvm, it's done by non-overlapping blocks.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Interesting paper - LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning #4785

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Interesting paper - LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning #4785

Uh oh!

Galunid Jan 5, 2024 Collaborator

The proposed algorithm

Perplexity results

Replies: 2 comments

Uh oh!

Uh oh!

ggerganov Jan 7, 2024 Maintainer

Uh oh!

Uh oh!

bullno1 Jan 8, 2024

Galunid
Jan 5, 2024
Collaborator

ggerganov
Jan 7, 2024
Maintainer

bullno1
Jan 8, 2024