Skip to content

Commit e30a369

Browse files
slarenteleprint-me
authored andcommitted
llama : disable pipeline parallelism with nkvo (ggml-org#7265)
1 parent a94019b commit e30a369

File tree

1 file changed

+5
-1
lines changed

1 file changed

+5
-1
lines changed

llama.cpp

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15858,7 +15858,11 @@ struct llama_context * llama_new_context_with_model(
1585815858
ctx->buf_compute_meta.resize(ggml_tensor_overhead()*LLAMA_MAX_NODES + ggml_graph_overhead_custom(LLAMA_MAX_NODES, false));
1585915859

1586015860
// enabling pipeline parallelism in the scheduler increases memory usage, so it is only done when necessary
15861-
bool pipeline_parallel = llama_get_device_count() > 1 && model->n_gpu_layers > (int)model->hparams.n_layer && model->split_mode == LLAMA_SPLIT_MODE_LAYER;
15861+
bool pipeline_parallel =
15862+
llama_get_device_count() > 1 &&
15863+
model->n_gpu_layers > (int)model->hparams.n_layer &&
15864+
model->split_mode == LLAMA_SPLIT_MODE_LAYER &&
15865+
params.offload_kqv;
1586215866
#ifndef GGML_USE_CUDA
1586315867
// pipeline parallelism requires support for async compute and events
1586415868
// currently this is only implemented in the CUDA backend

0 commit comments

Comments
 (0)