-
Notifications
You must be signed in to change notification settings - Fork 11.8k
llama : quantize up to 31% faster on Linux with mmap #3206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
How does incorporating |
When the 'quantize' script reads from disk, it normally has to load a whole tensor into memory before it can start converting it to f32 and quantizing it. This change allows the input tensor to be paged in on-demand in 4096-byte chunks so it can be read and converted simultaneously. |
I tested this with 7B f16 to q4_0. On Windows and got ~15% faster times with mmap when the model is cached, no difference when it is not cached. Under WSL2, mmap is always about ~35% faster, cached or uncached. So I think mmap can be enabled on Windows too. |
llama.cpp
Outdated
std::unique_ptr<llama_model_loader> ml(new llama_model_loader(fname_inp, /*use_mmap*/ false)); | ||
// mmap consistently increases speed Linux, is inconsistent on macOS | ||
// (possibly related to free memory), and has not been tested on Windows. | ||
#ifdef __linux__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#ifdef __linux__ | |
#if defined(__linux__) || defined(_WIN32) |
Let me run a few tests this week and we can merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On M1 Pro with 32GB, quantizing 13B with mmap
enabled is ~x2 slower, so let's leave mmap
off on Mac until we figure out something that would always improve the performance, regardless of the model size
…example * 'master' of github.com:ggerganov/llama.cpp: ggml-cuda : perform cublas mat mul of quantized types as f16 (ggml-org#3412) llama.cpp : add documentation about rope_freq_base and scale values (ggml-org#3401) train : fix KQ_pos allocation (ggml-org#3392) llama : quantize up to 31% faster on Linux and Windows with mmap (ggml-org#3206) readme : update hot topics + model links (ggml-org#3399) readme : add link to grammars app (ggml-org#3388) swift : fix build on xcode 15 (ggml-org#3387) build : enable more non-default compiler warnings (ggml-org#3200) ggml_tensor: update the structure comments. (ggml-org#3283) ggml : release the requested thread pool resource (ggml-org#3292) llama.cpp : split llama_context_params into model and context params (ggml-org#3301) ci : multithreaded builds (ggml-org#3311) train : finetune LORA (ggml-org#2632) gguf : basic type checking in gguf_get_* (ggml-org#3346) gguf : make token scores and types optional (ggml-org#3347) ci : disable freeBSD builds due to lack of VMs (ggml-org#3381) llama : custom attention mask + parallel decoding + no context swaps (ggml-org#3228) docs : mark code as Bash (ggml-org#3375) readme : add Mistral AI release 0.1 (ggml-org#3362) ggml-cuda : perform cublas fp16 matrix multiplication as fp16 (ggml-org#3370)
…l-org#3206) * llama : enable mmap in quantize on Linux -> 31% faster * also enable mmap on Windows --------- Co-authored-by: Georgi Gerganov <[email protected]>
This is a follow-up to #3115. It enables mmap for quantize on Linux, since no one seems to have reported a performance decrease on that platform. Windows has not been tested, and macOS has seen both a speed-up and a slow-down.