Replies: 3 comments 2 replies
-
| Bumping.  assert self.total_num_heads % tp_size == 0
# ...
assert self.total_num_kv_heads % tp_size == 0
# or
assert tp_size % self.total_num_kv_heads == 0And I also believe serving requires vocab size to be divisible by tp? And hidden size? And hidden layers? The issue is that total_num_heads = 40
total_num_kv_heads = 10
vocab_size = 32064
hidden_size = 5120
num_hidden_layers = 40Again, not sure about the vocab size, but, if that is the case, only rigs with 2 GPUs would work. If vocab size doesn't matter, we'd have to have either 2, 10, 20, or 40 GPUs | 
Beta Was this translation helpful? Give feedback.
-
| According to https://docs.vllm.ai/en/stable/serving/distributed_serving.html#multi-node-inference-and-serving 
 
 So I think you would need to use  | 
Beta Was this translation helpful? Give feedback.
-
| @ccruttjr did you ever find a solution to this issue? I am running into the same issue with Phi-3-medium-128k-instruct. I have 4 Nvidia Tesla T4s, which have 16 GB RAM each. Using dtype float16, the model should only be 26.0 GB, so it should fit on these GPUs easily. tensor-parallel-size 4 gives me this error | 
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello there :)
I'm trying to deploy
microsoft/Phi-3-medium-128k-instructon NVIDIA L4 GPU with the latest version of VLLM (0.5.0).I tried with 4 GPUs using the cli command:
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m vllm.entrypoints.openai.api_server --model microsoft/Phi-3-medium-128k-instruct --trust-remote-code --port 8000 --tensor-parallel-size 4But this throw an error as the number of KV heads (10) is not a multiple of 4.
Error details :
From my understanding, Phi-3-medium use the Phi3ForCasualLM architecture which is treated as a Llama model by VLLM.
The attention layer of this model throw this error as it tries to distribute the KV heads across multiple tensor parallel GPUs.
I can't use 2 GPUs only as there too low on memory and only have access to L4 GPUs for the moment.
If anyone has an idea on how to make this works, I'm all ears :)
Thanks
Beta Was this translation helpful? Give feedback.
All reactions