Skip to content

[Feature]: Tensor Parallelism with non divisble amount of attention heads #5003

@NadavShmayo

Description

@NadavShmayo

🚀 The feature, motivation and pitch

I am trying to run a 70B model on a node with 3XA100-80Gi.
2XA100-80Gi does not contain enough VRAM to run the model, and when I try to run vLLM with tensor parallelism of 3, it raises an error saying that the number of attention heads is not divisble by 3.

I looked into changing the tensor parallelism feature so that it supports an uneven division of the tensors between GPUs.
But I might be missing something here as there are a lot of validations in the codebase to avoid this scenario.
Is it possible to implement tensor parallelism this way?

Alternatives

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    feature requestNew feature or requeststaleOver 90 days of inactivity

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions