-
-
Notifications
You must be signed in to change notification settings - Fork 143
Description
Issue description
When loading a 8B Model in the npm create node-llama-cpp@latest it saturates memory to 24GB.
Expected Behavior
should only use 8~GB of vram
Actual Behavior
Shouldn't this only use 8GB of vram. I am using Q8.
My GPU Memory looks like 3 GB to start then jumps to 24
Steps to reproduce
Just install the latest npm create node-llama-cpp@latest and create an app run npm install then npm start and load the 8GB llama model.
My Environment
OS: Windows 10.0.26100 (x64) <-- says windows 10? but actually 11
Node: 22.13.0 (x64)
TypeScript: 5.7.3
node-llama-cpp: 3.6.0
CUDA: available
Vulkan: available
CUDA device: NVIDIA GeForce RTX 4090
CUDA used VRAM: 6.38% (1.53GB/23.99GB)
CUDA free VRAM: 93.61% (22.46GB/23.99GB)
Vulkan device: NVIDIA GeForce RTX 4090
Vulkan used VRAM: 6.38% (1.53GB/23.99GB)
Vulkan free VRAM: 93.61% (22.46GB/23.99GB)
Vulkan unified memory: 512MB (2.08%)
CPU model: AMD Ryzen 9 7900X 12-Core Processor
Math cores: 12
Used RAM: 50.15% (63.75GB/127.12GB)
Free RAM: 49.84% (63.37GB/127.12GB)
Used swap: 51.24% (76.41GB/149.12GB)
Max swap size: 149.12GB
mmap: supported
Additional Context
No response
Relevant Features Used
- Metal support
- CUDA support
- Vulkan support
- Grammar
- Function calling
Are you willing to resolve this issue by submitting a Pull Request?
Yes, I have the time, and I know how to start.