Description
I have two MI60's that don't perform well during prompt evaluation. What could be the reason?
Model Llama3-70B Q6:
llama_print_timings: prompt eval time = 3722.63 ms / 18 tokens ( 206.81 ms per token, 4.84 tokens per second)
llama_print_timings: eval time = 4274.60 ms / 35 runs ( 122.13 ms per token, 8.19 tokens per second)
compile:
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" cmake -S . -B build -DLLAMA_HIPBLAS=ON -DAMDGPU_TARGETS=gfx906 -DCMAKE_BUILD_TYPE=Release && cmake --build build --config Release -- -j 16
ROCk module version 6.7.0
When using an 8B model Q8, it does this:
llama_print_timings: prompt eval time = 200.58 ms / 18 tokens ( 11.14 ms per token, 89.74 tokens per second)
llama_print_timings: eval time = 1819.74 ms / 94 runs ( 19.36 ms per token, 51.66 tokens per second)
I also did this hack #3772 (comment) which fixed the garbled output issue but I don't know if it is related.
Now I am wondering if it is a 6 bit quantization issue..
Thank you!