-
-
Notifications
You must be signed in to change notification settings - Fork 7.8k
Mixtral GPTQ with TP=2 not generating output #2728
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Can you try adding |
I have the same issue with AWQ and this config: llm = LLM(
model="TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ",
tensor_parallel_size=2,
gpu_memory_utilization=0.75,
max_model_len=256,
disable_custom_all_reduce=True,
enforce_eager=True
) |
This works fine for me: llm = LLM(
model="TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ",
quantization="gptq",
dtype=torch.float16,
tensor_parallel_size=2,
max_model_len=16384,
revision="gptq-4bit-32g-actorder_True",
gpu_memory_utilization=0.75,
disable_custom_all_reduce=True,
enforce_eager=True) EDIT: Not all generations return even in this configuration. Example: prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is"
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=256)
|
I think there is an issue with that awq quant specifically, try using this one: https://huggingface.co/casperhansen/mixtral-instruct-awq it worked for me! just be sure to set enforce_eager=True like you did before |
@hanzhi713 Thanks! Works with the disable_custom_all_reduce command |
@SebastianBodza Which GPU model are you using? |
@hanzhi713 2x RTX 3090 with Cuda 12.3 |
@SebastianBodza Can you try this potential fix? #2760 Remember to recompile vLLM from source by |
Still not working for me. GPUs stuck at 150W usage and not returning. I don't know if it is relevant, but I have to run NCCL_P2P_DISABLE=1 to load the model |
it worked for me! just be sure to set enforce_eager=True like you did before For me, this also resolved the issue |
@SebastianBodza What error do you observe when you don't set NCCL_P2P_DISABLE=1? This might be relevant since if you can't let NCCL use P2P, custom all reduce shouldn't use it too, and custom all reduce can't function without P2P. |
When not setting NCCL_P2P_DISABLE=1 the Model loading freezes. Just like in #1801 |
@SebastianBodza I see. Then it's expected that it will also freeze with custom all reduce enabled. The underlying cause might be like #1801 in which the driver is buggy. |
@hanzhi713 Thanks for the fixes. I am not too sure if they were necessary, however SinanAkkoyun seems to approve them, sorry for that. Could we add a check for P2P and implement a fallback with disable_custom_all_reduce enabled or throwing an error? For me the vllm version 0.3.0 is working with nvidia drivers 535.154.05. |
@SebastianBodza Hi, I didn't approve any changes, I just approved that the other awq model provided + enforce_eager=True works for me, does it for you? |
@SebastianBodza Unfortunately, I don't think there's a good way to detect that. We already checked P2P support via cuda runtime API (NCCL likely uses the same check), and the check passed. Basically, the problem is that the runtime reports that it supports P2P, but the underlying implementation is buggy. |
I ran in to this with exllamav2 as well, the maintainer added a fix that just did a quick check to see if it was safe to move tensors between devices, and used that to determine if it would move directly (or move via CPU). The Nvidia drivers seem to have consistent issues on 4090s properly reporting this info. |
In the new vllm 0.3 release mixtral with gptq does not generate any output anymore. Loading the model works fine, when calling the
llm.generate
it gets stuck.Currently using:
The worker seems to be stuck in the
llm_engine.step()
function /_run_workers
call.The text was updated successfully, but these errors were encountered: