Add Mixture of Experts: Mixtral 8x 7B release

Mistral AI released their new model called `Mixtral` which is an MoE architecture based on MegaBlocks. It includes 8 experts with the size being 7 billion parameters each.

Here is the model configuration:

- dim: 4096
- n_layers: 32
- head_dim: 128
- hidden_dim: 14336
- n_heads: 32
- n_kv_heads: 8
- norm_eps: 1e-05
- vocab_size: 32000
- moe:
  - num_experts_per_tok: 2
  - num_experts: 8


Weights: https://twitter.com/MistralAI/status/1733150512395038967
Paper: https://arxiv.org/pdf/2211.15841.pdf
Code: https://github.com/stanford-futuredata/megablocks

CC: @WoosukKwon @zhuohan123 for visibility.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add Mixture of Experts: Mixtral 8x 7B release #1991

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Add Mixture of Experts: Mixtral 8x 7B release #1991

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions