Skip to content

Add Mixture of Experts: Mixtral 8x 7B release #1991

@casper-hansen

Description

@casper-hansen

Mistral AI released their new model called Mixtral which is an MoE architecture based on MegaBlocks. It includes 8 experts with the size being 7 billion parameters each.

Here is the model configuration:

  • dim: 4096
  • n_layers: 32
  • head_dim: 128
  • hidden_dim: 14336
  • n_heads: 32
  • n_kv_heads: 8
  • norm_eps: 1e-05
  • vocab_size: 32000
  • moe:
    • num_experts_per_tok: 2
    • num_experts: 8

Weights: https://twitter.com/MistralAI/status/1733150512395038967
Paper: https://arxiv.org/pdf/2211.15841.pdf
Code: https://github.com/stanford-futuredata/megablocks

CC: @WoosukKwon @zhuohan123 for visibility.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions