Skip to content

HIPBLAS / ROCm low prompt eval performance #7533

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
etemiz opened this issue May 25, 2024 · 2 comments
Closed

HIPBLAS / ROCm low prompt eval performance #7533

etemiz opened this issue May 25, 2024 · 2 comments

Comments

@etemiz
Copy link

etemiz commented May 25, 2024

I have two MI60's that don't perform well during prompt evaluation. What could be the reason?

Model Llama3-70B Q6:

llama_print_timings: prompt eval time = 3722.63 ms / 18 tokens ( 206.81 ms per token, 4.84 tokens per second)
llama_print_timings: eval time = 4274.60 ms / 35 runs ( 122.13 ms per token, 8.19 tokens per second)

compile:

HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" cmake -S . -B build -DLLAMA_HIPBLAS=ON -DAMDGPU_TARGETS=gfx906 -DCMAKE_BUILD_TYPE=Release && cmake --build build --config Release -- -j 16

ROCk module version 6.7.0

When using an 8B model Q8, it does this:

llama_print_timings: prompt eval time = 200.58 ms / 18 tokens ( 11.14 ms per token, 89.74 tokens per second)
llama_print_timings: eval time = 1819.74 ms / 94 runs ( 19.36 ms per token, 51.66 tokens per second)

I also did this hack #3772 (comment) which fixed the garbled output issue but I don't know if it is related.

Now I am wondering if it is a 6 bit quantization issue..

Thank you!

@Engininja2
Copy link
Contributor

The mul_mat_q kernels have tuning configurations for a few different GPU architectures. Right now anything with a smaller ISA version than RDNA2 gets the settings for RDNA1. RDNA1/gfx1010 has more VGPRs than gfx906 does so when compiled for gfx906 the kernels end up spilling a lot of VGPRs to scratch memory, which is slow.

For example mul_mat_q6_K() has a vgpr_spill_count of 200 registers compiled for gfx906, where gfx1010 has 0. When compiling with make you can set HIPFLAGS="-save-temps" to get a *.s file for each HIP source file that contains the resulting assembly and stats like the spill count and the max occupancy of a kernel on the GPU.

Someone with a Vega GPU will need to try new values for x, y, and nwarps for each mmq kernel, and I think that __launch_bounds__() will need to be set for Vega cards to get the compiler to limit occupancy to 2. The default looks like it's 4. The advantage of limiting occupancy is that this makes more VGPRs available, but one disadvantage is that the GPU has more trouble hiding memory latency with fewer waves in flight, so it needs testing to see which is better.

@github-actions github-actions bot added the stale label Jul 28, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants