Skip to content

rpc : use backend registry, support dl backends #13304

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
May 4, 2025
Merged

Conversation

slaren
Copy link
Member

@slaren slaren commented May 4, 2025

  • Adds support for GGML_BACKEND_DL
  • Adds -d, --device option to select the device to use with the RPC server
  • Moves CPU memory detection code to CPU backend

@github-actions github-actions bot added examples ggml changes relating to the ggml tensor library for machine learning labels May 4, 2025
@slaren slaren force-pushed the sl/rpc-dl-backend branch 3 times, most recently from 314ccd7 to afa429a Compare May 4, 2025 15:02
@slaren slaren force-pushed the sl/rpc-dl-backend branch from afa429a to 07da432 Compare May 4, 2025 15:04
Copy link
Collaborator

@rgerganov rgerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not able to test right now but changes look fine.

Note that we should move the message "Starting RPC server vX.Y.Z" in ggml_backend_rpc_start_server() as we no longer know which version we are starting in main(). I can fix this in a follow-up patch.

@slaren slaren merged commit 9fdfcda into master May 4, 2025
45 checks passed
@slaren slaren deleted the sl/rpc-dl-backend branch May 4, 2025 19:25
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request May 6, 2025
* origin/master: (27 commits)
llama : fix build_ffn without gate (ggml-org#13336)
CUDA: fix bad asserts for partial offload (ggml-org#13337)
convert : qwen2/3moe : set yarn metadata if present (ggml-org#13331)
CUDA: fix --split-mode row for MMQ (ggml-org#13323)
gguf-py : avoid requiring pyside6 for other scripts (ggml-org#13036)
CUDA: fix logic for clearing padding with -ngl 0 (ggml-org#13320)
sampling : Integrate Top-nσ into main sampling chain (and add it to the server) (ggml-org#13264)
server : Webui - change setText command from parent window to also send the message. (ggml-org#13309)
mtmd : rename llava directory to mtmd (ggml-org#13311)
clip : fix confused naming ffn_up and ffn_down (ggml-org#13290)
convert : bailingmoe : set yarn metadata if present (ggml-org#13312)
SYCL: Disable mul_mat kernels for noncontiguous tensor b (ggml-org#13308)
mtmd : add C public API (ggml-org#13184)
rpc : use backend registry, support dl backends (ggml-org#13304)
ggml : activate s390x simd for Q3_K (ggml-org#13301)
llava/mtmd : fixes to fully support dl backends (ggml-org#13303)
llama : build windows releases with dl backends (ggml-org#13220)
CUDA: fix race condition in MMQ stream-k fixup (ggml-org#13299)
CUDA: fix race condition in MMQ ids_dst (ggml-org#13294)
vulkan: Additional type support for unary, binary, and copy (ggml-org#13266)
...
@segmond
Copy link

segmond commented May 12, 2025

I'm guessing this will be a huge change, but what would it take to get the -d to behave like in other llama-cli & llama-server, so instead of running N number of servers for N devices in a remote host, we will run 1 server per remote node?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants