- 
                Notifications
    You must be signed in to change notification settings 
- Fork 13.5k
Implement and use cuda graph plans #16548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| One thing to mention here is that on hip the generation of the graph is relatively more expensive than on cuda which makes the removal disable_due_to_too_many_updates have an perhaps outsized negative effect there. I will try to find the time to investigate. | 
| @AesSedai Could you give a try at gpt-oss-120b? I found that for some reason this PR applied to latest master works much slower for GLM 4.5 Air. Will need to investigate further. My testing shows gpt-oss-120b still has the same speedup. @IMbackK My wild guess is that if the graph gets reused for a few times it should pay off the graph building overhead. Without this PR it is understandable to have this  | 
| I'll have to download it and set it up, but I should be able to run it later tonight. | 
| Looks like I screwed up the previous commits as the the graph update logic is completely broken. I can't even reproduce my previous results :( Anyway the issues should now be fixed. Refactored a bit and added back the  @AesSedai @DocShotgun Thank you for your tests! Please check if this commit (3afbd9f) works. Performance comparison
 | 
| Interesting. Seems like it's potentially beneficial in certain cases but not in others (and potentially a regression in some cases). Sadly not the free lunch I was hoping for lol. | 
| 
 I just rebuilt fresh and did a 3-way comparison between mainline lcpp master, this PR, and ik_llama.cpp@main (including ikawrakow/ik_llama.cpp#829 which helps a lot with -ooae over there). tl;dr; this PR is possibly giving a bit over +2% TG over master with basically the same PP in quick llama-sweep-bench testing on my hybrid GPU+CPU gaming rig.   👈 Detailsmodel=/mnt/ai/models/ggml-org/gpt-oss-120b-GGUF/gpt-oss-120b-mxfp4-00001-of-00003.gguf
./build/bin/llama-sweep-bench \
  --model "$model" \
  -fa 1 \
  -c 20480 \
  -ub 4096 -b 4096 \
  -ngl 99 \
  -ot "blk\.(0|1|2|3|4|5|6|7|8)\.ffn.*=CUDA0" \
  -ot "ffn_(up|gate|down)_exps\.weight=CPU" \
  --threads 16 \
  --no-mmap
# for ik_llama.cpp add
  -fmoe \
  -ooae \
  --warmup-batchik_llama.cpp/main@dbfd1515 -fmoe -ooae --warmup-batch
 llama.cpp/master@683fa6ba
 llama.cpp/wishstudio:cuda_graph_plan PR16548 3afbd9f rebased to 683fa6b + ug/port-sweep-bench
 | 
| Super interesting PR! Off-topic to this PR itself, but relevant to the discussion: What is  | 
| @DocShotgun I guess it's expected that this optimization is model and machine specific 😄 @DocShotgun @AesSedai @ubergarm Looks like all of you are running this in Linux and I'm running Win11. PP performance is very unstable in my tests but it looks much more stable on your tests. During my tests I also see lots of random latencies and slowdowns during kernel launches. This could explain that on my system this PR removes much more overhead than on your systems. But anyway even if this PR does not help much on your systems, I didn't expect it to cause any slowdown. Theoretically the only thing this PR changes on these MoE models are just making a consolidated  @aendk I spent like 10 minutes searching for it and ended up with an example in ik_llama. | 
| 
 I can confirm that kernel launch overheads are significantly more expensive and varying on Win11 compared to Linux for NVGPUS due to their batched dispatch via WDDM2. | 
| I've heard the same thing about kernel launch overhead being significantly higher on Windows than Linux. I notice you mention in the OP that you found building a graph and executing it only once to be always faster than calling the kernels individually - perhaps the lower kernel launch overhead on Linux makes this not necessarily true there? | 
| 
 Its a long story but I believe  Some folks use it to sample PP/TG speeds across a given kv-cache depth and use vibe coded python scripts to plot/visualize the results across multiple runs. I'd be happy to see it added officially but haven't fixed up the arguments enough to open a proper PR. Go ask over on https://huggingface.co/BeaverAI discord if u want more details or example usage. | 
| 
 Another reason to switch to Linux 🥲 
 Could be. But in the last commit I introduced back the  | 
| I tested this on windows, on GLM 4.6, on a 5090 + 9950x3d. I saw an average loss of about 8% compared to mainline, taking me from 5.2 to 4.7 tok/s using this quant. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I welcome the proposal, I feel there are issues that still need to be addressed, namely:
- stricter equivalence checking when updating graph plans, and
- fixing the logic in the cuda backend
To sched some light on the perf gains here some profiles (with locally hacked fix of point 2 raised above), where I offloaded a single FFN with
sudo nsys profile --output=gp_offload --trace=cuda,nvtx,osrt --sample=cpu --gpu-metrics-devices=all --cuda-graph-trace=node ./build-x64-linux-gcc-reldbg/bin/llama-cli -m models/gemma-3n-E2B-it-Q8_0.gguf -fa 1 -p "Hello World" --top-k 1 -ot "blk\.1.ffn*=CPU" --no-conversation -n 10
 
One can see that CUDA Graphs nicely speed up the GPU portion of the workload (kuda kernels run 100% of the time, as we eliminate kernel-launch overheads). However, the CPU workload does not profit at all. Given that we are on heterogeneous systems, optimizing GPU workload may actually induce thermal throttling on CPU as CPU will be active more frequently in a given timeframe. Thus, slowdowns can occur in E2E tests.
More performant approaches for models that are too large to fit in vram would be ones where weights are streamed async onto the GPU (e.g. while previous workload is running either on GPU or CPU). Though such approaches are also more complex.
| if (sched->splits[i].backend_id != sched->plans[i].backend_id) { | ||
| create_new_plans = true; | ||
| break; | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imo, we should actually check for the full equivalence of the generated splits (i.e. the ggml_cgraph objects contained in the splits & plans instead of just their backends)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current check is a bare minimum for ensuring correctness. It basically says "if the number of splits or backend of any split is changed we rebuilt everything". If only the interior of a split is changed, it will be delegated to the backend's graph update routine to handle the change (code is just next paragraph below this check). In current CUDA backend, if the graph change is too large the cudaGraphExecUpdate will fail and cudaGraphInstantiate will be called to reinstantiate a new executable graph. I think this will not generate too much overhead as the cudaGraphInstantiate is inevitable and only the cudaGraphExecUpdate call can be saved.
In order to improve this, I guess something like the enhanced API in #14514 needs to be implemented. It passes more information to help the backend make an informed decision on recreate vs update. Anyway I believe most scaffolds are missing so this would be out of scope for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, if the split graph is indeed identical, the CUDA backend will find out via the is_cuda_graph_update_required fast path. I don't think it's a good idea to move the checks up to the backend scheduler as different backends could have different definition of equivalence depending on various factors as discussed in #14514. We may calculate these factors in the scheduler but the ultimate decision is still better be done in the backend.
| Added the  | 
| Looks like there is a nasty destruction order bug. This PR makes backend sched depends on backend for destruction. But in llama_context their order is unspecified. Not sure about the best way to resolve this. | 
| 
 @ggerganov @slaren Would you mind take a look at this matter, as well as this PR in its entirety? Based on my limited knowledge, I think the best way to fix this issue is to create a backend-agnostic interface class for graph plans and move related APIs except  | 






This PR tries to implement graph plan APIs for the CUDA backend, as well as implement code in ggml-backend.cpp to actually use the graph plan APIs when a backend supports it.
The main functional improvement is to support cuda graphs when the graph is split (e.g. for hybrid inference). Currently the graph update and reuse logic (
ggml_backend_sched_update_plans) is a simple heuristic: only try updating previous graphs when the number of splits and their corresponding backends are the same as the previous run. As the benchmark shown this universally accelerate hybrid inference tg performance by up to 30%.The CUDA graph execution code is refactored and cleaned up. Two out of three original graph plan fail path are removed:
disable_due_to_failed_graph_captureanddisable_due_to_too_many_updates. The former one is due to the fact I found no code setting it to true. The latter is because I have currently no idea about the semantics in a split graph scenario. But it seems to not degrade the performance at all. Interestingly, I found that on my rig, even repeatitively build a graph then execute it only once is always faster than calling the kernels individually. I suspect it is the reason that the performance increased in tests even for CUDA only workloads, given this PR's optimization not targeting them. This of course needs to be verified on more hardware configurations.Performance comparison:
RTX 5090 + 13700k, 128GB 6400 MT/s RAM