Description
Name and Version
build: 4449 (8a1d9c2) with cc (Debian 13.3.0-11) 13.3.0 for x86_64-linux-gnu
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-cli
Command line
./build/bin/llama-cli -m ds3-q8.gguf -t 128 --numa distribute -c 8192 -ngl 0 --interactive-first --chat-template deepseek3
Problem description & steps to reproduce
If I load a dense model, it will warmup the model correctly, loading the whole thing into OS cache.
However, if I load a big MoE in (eg. deepseek 3), it will only load a small portion (93GB/660GB)
I tested this and made an inefficient bruteforce patch to common.cpp:
> if (decoder_start_token_id == -1) {
995,1003c992,993
< printf("decoding warmup tokens.");
< for (int i = 1; i <256 ; i++) {
< llama_decode(lctx, llama_batch_get_one(tmp.data(), std::min(tmp.size(), (size_t) params.n_batch)));
< tmp.clear();
< tmp.push_back(i);
< printf(".");
< }
< } else { LOG_WRN("No Decoder Present. Warmup impossible"); }
< printf("\n");
The benefit falls off sharply with the number of llama_decode() calls. e.g. With 256 calls it gets 540GB of the model loaded. 1024 gets 620.
I think that ideally this function would detect the number of experts and call a function that would choose a single token through each expert via the router (this may need a function other than llama_decode that is expert router aware?)
I could probably make a good PR for this with some guidance.
First Bad Commit
This has never worked afaik
Relevant log output
No logging for this problem. Need to watch OS cache usage with a tool.