Skip to content

Multi-GPU support for AMD? #3051

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ElliottDyson opened this issue Sep 7, 2023 · 28 comments
Closed

Multi-GPU support for AMD? #3051

ElliottDyson opened this issue Sep 7, 2023 · 28 comments
Labels

Comments

@ElliottDyson
Copy link

Do you have multi-GPU support for AMD, if not, do you see it as something you might add in the future?

@ccbadd
Copy link

ccbadd commented Sep 7, 2023

If you compile with hipBLAS then you can use multiple AMD gpu's. I have it working fine in Linux and am working on getting it to compile under windows.

@jart
Copy link
Contributor

jart commented Jan 27, 2024

Multiple AMD GPU support isn't working for me. @ccbadd Have you tried it? I checked out llama.cpp from early Sept. 2023 and it isn't working for me there either. I don't think it's ever worked. I have a Linux system with 2x Radeon RX 7900 XTX. Both of them are recognized by llama.cpp. But the LLM just prints a bunch of # tokens. I have workarounds. For starters, I can say export HIP_VISIBLE_DEVICES=0 to force the HIP SDK to only show the first GPU to llama.cpp. Alternatively, I can say -ts 1,0 or -ts 0,1 so that tensor splitting favors one GPU or the other, and both of those flags work. But the moment the split touches multiple GPUs the LLM starts outputting gibberish. Has anyone got any ideas of how I might troubleshoot and solve this for you?

@ccbadd
Copy link

ccbadd commented Jan 27, 2024

@jart Depending on the version you have pulled, multi gpu support for both AMD and NVidia has been a little unstable. I would pull a more recent build and try again. Be sure to update ROCm also as v6 is available right now that has better support for your gpus.

@mjkpolo
Copy link

mjkpolo commented Feb 1, 2024

I got near 100% utilization across 8 AMD MI100 gpus (gfx908) but I had to pass -ngl, and -sm row gave better performance (I need to learn what the difference is between layer and row). for -ngl I made a guess, then it told me

llm_load_tensors: offloaded X/81 layers to GPU

so I used -ngl 81. Is there a way to have it offload the max number of layers possible? It feels weird to have to make a guess first since it seems facebook's llama framework does this by default.

@jart
Copy link
Contributor

jart commented Feb 1, 2024

But does it actually work? Am I correct in my understanding that llama.cpp is able to run an LLM on multiple AMD GPUs so long as they're the AMD Instinct HPC enterprise cards? I have 2x Radeon RX 7900 XTX because they're the cheapest cards AMD supports using on Linux.

$ ./main -m ~/weights/mistral-7b-instruct-v0.2.Q5_K_M.gguf -p 'hello world' --temp 0 -ngl 33
Log start
main: build = 2039 (d71ac909)
main: built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu
main: seed  = 1706805764
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 2 ROCm devices:
  Device 0: Radeon RX 7900 XTX, compute capability 11.0, VMM: no
  Device 1: Radeon RX 7900 XTX, compute capability 11.0, VMM: no
...
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      ROCm0 buffer size =  2495.28 MiB
llm_load_tensors:      ROCm1 buffer size =  2311.77 MiB
llm_load_tensors:        CPU buffer size =    85.94 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      ROCm0 KV buffer size =    34.00 MiB
llama_kv_cache_init:      ROCm1 KV buffer size =    30.00 MiB
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
llama_new_context_with_model:  ROCm_Host input buffer size   =     9.01 MiB
llama_new_context_with_model:      ROCm0 compute buffer size =    80.30 MiB
llama_new_context_with_model:      ROCm1 compute buffer size =    89.10 MiB
llama_new_context_with_model:  ROCm_Host compute buffer size =     8.80 MiB
llama_new_context_with_model: graph splits (measure): 5

system_info: n_threads = 16 / 32 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
sampling: 
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.000
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp 
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0


 hello world▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅

@mjkpolo
Copy link

mjkpolo commented Feb 1, 2024

it worked for me but I was using llama2-70b model.

$ salloc -p mi1008x -N 1 -t 00:30:00
salloc: ---------------------------------------------------------------
salloc: AMD HPC Fund Job Submission Filter
salloc: ---------------------------------------------------------------
salloc: --> ok: runtime limit specified
salloc: --> ok: using default qos
salloc: --> ok: Billing account-> sinclair/mkurzynski
salloc: --> checking job limits...
salloc:     --> requested runlimit = 0.5 hours (ok)
salloc: --> checking partition restrictions...
salloc:     --> ok: partition = mi1008x
salloc: --> checking job size restrictions...
salloc:     --> requested nodes = 1 (ok)
salloc: Granted job allocation 18456
mkurzynski@t007-009:/work1/sinclair/mkurzynski/github/llama.cpp$ ./main -sm row -m models/ggml-model-f16.gguf -ngl 61 -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e
Log start
main: build = 2006 (fbf1ddec)
main: built with cc (GCC) 12.2.0 for x86_64-pc-linux-gnu
main: seed  = 1706805963
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 8 ROCm devices:
  Device 0: , compute capability 9.0, VMM: no
  Device 1: , compute capability 9.0, VMM: no
  Device 2: , compute capability 9.0, VMM: no
  Device 3: , compute capability 9.0, VMM: no
  Device 4: , compute capability 9.0, VMM: no
  Device 5: , compute capability 9.0, VMM: no
  Device 6: , compute capability 9.0, VMM: no
  Device 7: , compute capability 9.0, VMM: no
llama_model_loader: loaded meta data with 15 key-value pairs and 723 tensors from models/ggml-model-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 8192
llama_model_loader: - kv   4:                          llama.block_count u32              = 80
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 28672
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 64
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 1
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - type  f32:  161 tensors
llama_model_loader: - type  f16:  562 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 80
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 68.98 B
llm_load_print_meta: model size       = 128.48 GiB (16.00 BPW)
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.83 MiB
llm_load_tensors: offloading 61 repeating layers to GPU
llm_load_tensors: offloaded 61/81 layers to GPU
llm_load_tensors:      ROCm0 buffer size =     3.81 MiB
llm_load_tensors:        CPU buffer size = 32009.22 MiB
llm_load_tensors: ROCm_Split buffer size = 99552.00 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      ROCm0 KV buffer size =   122.00 MiB
llama_kv_cache_init:  ROCm_Host KV buffer size =    38.00 MiB
llama_new_context_with_model: KV self size  =  160.00 MiB, K (f16):   80.00 MiB, V (f16):   80.00 MiB
llama_new_context_with_model:  ROCm_Host input buffer size   =    17.01 MiB
llama_new_context_with_model:      ROCm0 compute buffer size =   177.10 MiB
llama_new_context_with_model:  ROCm_Host compute buffer size =   158.40 MiB
llama_new_context_with_model: graph splits (measure): 5

system_info: n_threads = 64 / 128 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp
generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0


 Building a website can be done in 10 simple steps:
Step 1: Determine the purpose of your website. What do you want to achieve with it? Who is your target audience?
Step 2: Choose a domain name and web hosting service. Your domain name should be easy to remember and reflect the purpose of your website. You’ll also need to choose a web host, which provides space on their servers for your site files.
Step 3: Design or purchase a template for your website. There are many free templates available online that you can customize to fit your needs. Alternatively, you could hire a designer to create something unique for you.
Step 4: Write content for each page of your website using HTML code (or use an editor like Dreamweaver). This includes both text and images/graphics as appropriate. Make sure all pages have titles and meta tags so they’ll show up correctly in search engines when people are searching for information about topics related to what you offer on your site!
Step 5: Add functionality such as forms, contact pages, etc., if desired (this may require additional programming knowledge).
Step 6: Test everything thoroughly before launching publicly – make sure all links work properly and there aren’t any broken images/graphics anywhere.
Step 7: Launch your finished product! Promote it via social media channels like Facebook and Twitter so people know about it right away.
Step 8: Monitor traffic statistics regularly using analytics tools such as Google Analytics – this will help give insight into how well things are working overall plus identify areas where improvement might be needed down the road.
Step 9: Keep updating content regularly with fresh material relevant to current events/trends related specifically towards whatever niche market(s) you’re targeting through promotion efforts mentioned earlier above point number seven here listed below now underlined for emphasis purposes only please remember that these steps must be followed carefully in order for success online so don’t
llama_print_timings:        load time =   60713.42 ms
llama_print_timings:      sample time =      79.83 ms /   400 runs   (    0.20 ms per token,  5010.77 tokens per second)
llama_print_timings: prompt eval time =    1048.03 ms /    19 tokens (   55.16 ms per token,    18.13 tokens per second)
llama_print_timings:        eval time =  125884.01 ms /   399 runs   (  315.50 ms per token,     3.17 tokens per second)
llama_print_timings:       total time =  127151.14 ms /   418 tokens
Log end

@jart
Copy link
Contributor

jart commented Feb 1, 2024

@mjkpolo In your case it only appears to be assigning to be assigning work to the first GPU.

@mjkpolo
Copy link

mjkpolo commented Feb 1, 2024

@jart how can you tell? I accidentally used -ngl 61 instead of -ngl 81 but when I spam rocm-smi I see this:

========================= ROCm System Management Interface =========================
=================================== Concise Info ===================================
GPU  Temp (DieEdge)  AvgPwr  SCLK     MCLK     Fan  Perf  PwrCap  VRAM%  GPU%
0    42.0c           161.0W  1502Mhz  1200Mhz  0%   auto  290.0W   54%   99%
1    41.0c           151.0W  1502Mhz  1200Mhz  0%   auto  290.0W   53%   99%
2    43.0c           165.0W  1502Mhz  1200Mhz  0%   auto  290.0W   53%   99%
3    41.0c           161.0W  1502Mhz  1200Mhz  0%   auto  290.0W   53%   99%
4    39.0c           157.0W  1502Mhz  1200Mhz  0%   auto  290.0W   53%   97%
5    40.0c           159.0W  1502Mhz  1200Mhz  0%   auto  290.0W   53%   96%
6    40.0c           152.0W  1502Mhz  1200Mhz  0%   auto  290.0W   53%   95%
7    42.0c           153.0W  1502Mhz  1200Mhz  0%   auto  290.0W   53%   88%
====================================================================================
=============================== End of ROCm SMI Log ================================

@jart
Copy link
Contributor

jart commented Feb 1, 2024

Notice how it says on mine:

llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
llama_new_context_with_model:  ROCm_Host input buffer size   =     9.01 MiB
llama_new_context_with_model:      ROCm0 compute buffer size =    80.30 MiB
llama_new_context_with_model:      ROCm1 compute buffer size =    89.10 MiB
llama_new_context_with_model:  ROCm_Host compute buffer size =     8.80 MiB
llama_new_context_with_model: graph splits (measure): 5

Versus yours:

llama_new_context_with_model: KV self size  =  160.00 MiB, K (f16):   80.00 MiB, V (f16):   80.00 MiB
llama_new_context_with_model:  ROCm_Host input buffer size   =    17.01 MiB
llama_new_context_with_model:      ROCm0 compute buffer size =   177.10 MiB
llama_new_context_with_model:  ROCm_Host compute buffer size =   158.40 MiB
llama_new_context_with_model: graph splits (measure): 5

If all your GPUs are being utilized, then it's probably because your GPUs are capable of presenting themselves to llama.cpp as a single unified device. I mean I wish I had your computer. But I've just got separate cards plugged into two PCIE slots on a consumer PC. I know llama.cpp is capable of splitting manually the work across multiple cards for NVIDIA. I've seen it happen. I'd just like for it to be able to do that with AMD too.

@mjkpolo
Copy link

mjkpolo commented Feb 1, 2024

Ohhh I see that's really interesting and good to keep in mind, thx!

@mjkpolo
Copy link

mjkpolo commented Feb 1, 2024

@jart Update, I tried with -sm layer and I see this (and the output isn't garbled)

llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      ROCm0 KV buffer size =    22.00 MiB
llama_kv_cache_init:      ROCm1 KV buffer size =    20.00 MiB
llama_kv_cache_init:      ROCm2 KV buffer size =    20.00 MiB
llama_kv_cache_init:      ROCm3 KV buffer size =    20.00 MiB
llama_kv_cache_init:      ROCm4 KV buffer size =    20.00 MiB
llama_kv_cache_init:      ROCm5 KV buffer size =    20.00 MiB
llama_kv_cache_init:      ROCm6 KV buffer size =    20.00 MiB
llama_kv_cache_init:      ROCm7 KV buffer size =    18.00 MiB
llama_new_context_with_model: KV self size  =  160.00 MiB, K (f16):   80.00 MiB, V (f16):   80.00 MiB
llama_new_context_with_model:  ROCm_Host input buffer size   =    17.01 MiB
llama_new_context_with_model:      ROCm0 compute buffer size =   159.50 MiB
llama_new_context_with_model:      ROCm1 compute buffer size =   177.10 MiB
llama_new_context_with_model:      ROCm2 compute buffer size =   177.10 MiB
llama_new_context_with_model:      ROCm3 compute buffer size =   177.10 MiB
llama_new_context_with_model:      ROCm4 compute buffer size =   177.10 MiB
llama_new_context_with_model:      ROCm5 compute buffer size =   177.10 MiB
llama_new_context_with_model:      ROCm6 compute buffer size =   177.10 MiB
llama_new_context_with_model:      ROCm7 compute buffer size =   177.10 MiB
llama_new_context_with_model:  ROCm_Host compute buffer size =    17.60 MiB
llama_new_context_with_model: graph splits (measure): 17

@ccbadd
Copy link

ccbadd commented Feb 1, 2024

One of my setups has 2x MI100's and 2X W6800's and all 4 work fine but the MI100's are a lot slower than the W6800s. I have never seen the -sm switch, I'll give it a try in a while to see if it helps.

@ggerganov
Copy link
Member

By default the -sm switch is set to "layer" which splits the model at the layer level. This is in contrast to "row" which was the only option until recently which splits each layer across the GPUs. You generally want to use -sm layer as it will yield better performance in most cases.

I believe multi-GPU should also work with AMD hardware, but I haven't tested. It does work with NVIDIA GPUs. There could be some ROCm related issues, but maybe try a few different models first to see if it is not a model problem

@morphles
Copy link

morphles commented Feb 4, 2024

I'm also getting crap when my dual 7900 xtx are used, thought maybe it was models I was trying, so tried old one I know really worked on my 4080 machine, but it is also giving gibberish when run with both 7900. Single one (via HIP_VISIBLE_DEVICES=x) works ok. But I want to run larger models :) -sm row does not help.

Now one thing I must say, I'm using desktop class CPU and mobo (ryzen5950x), and it does not have have enough pcie lanes to connect both cards to cpu. pytorch basically refuses to run at all across two cards last time I tried. But I'm not sure if that matters for llama.cpp ?

@francis2tm
Copy link
Contributor

@morphles did you manage to make it work on 2x 7900xtx? Have you tried Mixtral?

@ccbadd
Copy link

ccbadd commented Mar 2, 2024

In my setup I was able to use all 4 cards for a single model but now that the layer/row split thing has been implemented it no longer works properly. The 2 MI100s need to -sm option to be fast(er) but the W6800 will not work with that set. Shouldn't the W6800 work with layer or row split?

@wizd
Copy link

wizd commented Mar 13, 2024

any progress? my dual 7900 XTX running ollama just outputs '#############....'

@morphles
Copy link

So got back to checking the AI stuff, and seems with hipblas my dual 7900xtx still produce garbage. And vulkan for now also seems to not work across multiple gpus?

@ccbadd
Copy link

ccbadd commented Mar 20, 2024

Vulkan does work for multi gpus as I have tested it with dual A770s and dual W6800s. Not all quant and model types are supported yet so make sure you are not trying to use an unsupported model, Mixtral for instance, or an unsupported quant, type a simple Q4_0 quant and you will see.

@morphles
Copy link

@ccbadd yeah managed to make it work, even on more novel command-r model it worked. And yeah I saw that mistral for now did not work. I just needed env var to make both GPUs visible.
Supper happy now :) Finally my dual AMD stuff is sorta paying off.

@Speedway1
Copy link

I am running MixTAO-7Bx2-MoE-v8.1 and just get a bunch of garbage ######### coming out per the above poster. Assuming that this is because it's based on Mistral too?

Got 2 x 7900XTX.

@slaren
Copy link
Member

slaren commented Apr 7, 2024

Try the flag described here if you have issues with AMD multi GPU:
#6208

@Speedway1
Copy link

Try the flag described here if you have issues with AMD multi GPU: #6208

Thank you for the super-fast reply and also for the work that you are doing on this project. Really appreciate it.

I recompiled with the flag set and now it "hangs" for several minutes before failing with this CUDA error (not surprising as it's AMD not NVIDIA):

llm_load_tensors: ROCm0 buffer size = 12784.80 MiB
llm_load_tensors: ROCm1 buffer size = 11530.72 MiB
llm_load_tensors: CPU buffer size = 250.00 MiB
...................................................................................................
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: ROCm0 KV buffer size = 272.00 MiB
llama_kv_cache_init: ROCm1 KV buffer size = 240.00 MiB
llama_new_context_with_model: KV self size = 512.00 MiB, K (f16): 256.00 MiB, V (f16): 256.00 MiB
llama_new_context_with_model: ROCm_Host output buffer size = 0.12 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=4)
llama_new_context_with_model: ROCm0 compute buffer size = 372.02 MiB
llama_new_context_with_model: ROCm1 compute buffer size = 372.02 MiB
llama_new_context_with_model: ROCm_Host compute buffer size = 40.02 MiB
llama_new_context_with_model: graph nodes = 1638
llama_new_context_with_model: graph splits = 3
ggml_cuda_compute_forward: RMS_NORM failed
CUDA error: shared object initialization failed
current device: 0, in function ggml_cuda_compute_forward at /data/home/llamacpp/llama.cpp/ggml-cuda.cu:2212
err
GGML_ASSERT: /data/home/llamacpp/llama.cpp/ggml-cuda.cu:60: !"CUDA error"
Could not attach to process. If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.
Aborted (core dumped)

Tried several ways to build:
CC=/opt/rocm/llvm/bin/clang CXX=/opt/rocm/llvm/bin/clang++ cmake -H. -Bbuild -DLLAMA_HIPBLAS=ON -DLLAMA_CUDA_NO_PEER_COPY=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -- -j 16

make clean && LLAMA_HIPBLAS=1 LLAMA_CUDA_NO_PEER_COPY=1 make -j

Etc.

@Speedway1
Copy link

Hmm.... this may have been a context window thing, because after another build I now have it working. Your patch was the secret source. Thank you so much for doing that patch.

I dropped the context window down to 4096.

I also did a rebuild with this command line:

make clean && LLAMA_HIPBLAS=1 LLAMA_CUDA_NO_PEER_COPY=1 make -j 16

In case it helps anyone else, this is how I am running it:

./main -m ../models/MixTAO-7Bx2-MoE-v8.1.gguf -n 256 -c 4096 --interactive-first --repeat_penalty 1.0 --color -i -ngl 33

@Speedway1
Copy link

Check https://rocm.docs.amd.com/projects/HIP/en/latest/user_guide/faq.html#i-have-multiple-hip-enabled-devices-and-i-am-getting-an-error-code-hiperrorsharedobjectinitfailed-with-the-message-error-shared-object-initialization-failed

Oh that's very helpful thank you!
Maybe it picked up the devices on the 2nd time around but I will play around with offload_arch if I have any other build issues.

Thank you once again. Really appreciate it.

@fraschm1998
Copy link

@Speedway1 @ccbadd How's multi card support now with Rocm 6.1 released? I currently have a 3090 and am considering either getting dual 7900xtx or dual mi100s, or a single 4090. What are you speeds like? And any issues using faster-whisper or alternatives?

@github-actions github-actions bot added the stale label May 28, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests