Skip to content

Conversation

CharlieFRuan
Copy link
Member

@CharlieFRuan CharlieFRuan commented Nov 8, 2023

This PR adds support for Mistral. The implementation follows the Mistral paper, specifically including sliding window attention (SWA), rolling buffer cache, and chunking, as discussed in Section 2 in the paper.

This PR is largely analogous to the changes in llm_chat.cc in mlc-llm's PR mlc-ai/mlc-llm#1087.

Different from the approaches in #202, this PR's implementation takes advantage of SWA, so that there is no max window size anymore--one of the main benefits of Mistral.

Tested:

  • Works well with:
    • 4096 sliding window size, with 1024 chunk size
      • Note that the small chunk size has no effect on generated output, but only less memory requirement with slower speed
    • 2048 sliding window size with 2048 chunk size
  • Tested with prompts that are multiple times in length of the sliding window size / chunk size

Irrelevantly, we also make wizard models reuse llama model libraries given the dynamic vocab size support (updated just now due to shuffle support).

cc @tqchen

@tqchen tqchen merged commit a9efc67 into mlc-ai:main Nov 8, 2023
@tqchen
Copy link
Contributor

tqchen commented Nov 8, 2023

great, @CharlieFRuan let us make a new npm release and update the demo

@CharlieFRuan CharlieFRuan deleted the pr-1106-shuffle branch November 9, 2023 00:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants