You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Actually LlamaContext and LlamaChatSession VRAM unallocation is done by the GC when we unset the variable containing them. But relying on the GC for freeing the VRAM can be complicated if we want to programatically handle multiple context/sessions.
For example in my application I need to make inferences to multiple chat contexts (differents prompts, histories ...). Actually I fork a worker.js everytime and rely on the child process kill to free the VRAM. But it is slow and cumberstone ...
Code example filling the VRAM
... because the garbage collector does not have the time to free the VRAM when context / session vars are replaced. We can have a workaround by exposing node GC and run manually but it depends of the environment (and in my case not possible).
// ...letllama=awaitgetLlama()constmodel=awaitllama.loadModel({modelPath: path.join(__dirname,"models","dolphin-2.1-mistral-7b.Q4_K_M.gguf")})letcontextletsessionleti=0while(true){i++context=awaitmodel.createContext()session=newLlamaChatSession({contextSequence: context.getSequence()})constq1='Hi there, how are you?'console.log(`${i} User: ${q1}`)consta1=awaitsession.prompt(q1)console.log(`${i} AI: ${a1}`)constq2='Summarize what you said'console.log(`${i} User: ${q2}`)consta2=awaitsession.prompt(q2)console.log(`${i} AI: ${a2}`)}
Additionnal note, if we sleep for like 10 seconds between each loop, the GC have the time to free the vram. But also, it is not a really nice solution.
The Solution
Maybe having something like a LlamaContext.unload() or LlamaChatSession.unload(), letting us free the VRAM for another context/session ?
Considered Alternatives
I don't have any other solution to have a method for unloading the VRAM directly from the objects instead of relying on the node GC.
Additional Context
I have read some related problems in the python wrapper side.
There's already a .dispose() function available on all the objects that you can use:
import{fileURLToPath}from"url";importpathfrom"path";import{getLlama,LlamaChatSession}from"node-llama-cpp";const__dirname=path.dirname(fileURLToPath(import.meta.url));constllama=awaitgetLlama();constmodel=awaitllama.loadModel({modelPath: path.join(__dirname,"models","dolphin-2.1-mistral-7b.Q4_K_M.gguf")});constcontext=awaitmodel.createContext();console.log("VRAM usage",(awaitllama.getVramState()).used);awaitcontext.dispose();// dispose the contextconsole.log("VRAM usage",(awaitllama.getVramState()).used);awaitmodel.dispose();// dispose the model and all of its contextsconsole.log("VRAM usage",(awaitllama.getVramState()).used);
You can also use await using to automatically dispose things when they become out of scope:
import{fileURLToPath}from"url";importpathfrom"path";import{getLlama,LlamaChatSession}from"node-llama-cpp";const__dirname=path.dirname(fileURLToPath(import.meta.url));constllama=awaitgetLlama();{await using model=awaitllama.loadModel({modelPath: path.join(__dirname,"models","dolphin-2.1-mistral-7b.Q4_K_M.gguf")});console.log("VRAM usage",(awaitllama.getVramState()).used);}// the model will be automatically disposed when this line is reachedconsole.log("VRAM usage",(awaitllama.getVramState()).used);
Feature Description
Actually
LlamaContext
andLlamaChatSession
VRAM unallocation is done by the GC when we unset the variable containing them. But relying on the GC for freeing the VRAM can be complicated if we want to programatically handle multiple context/sessions.For example in my application I need to make inferences to multiple chat contexts (differents prompts, histories ...). Actually I fork a
worker.js
everytime and rely on the child process kill to free the VRAM. But it is slow and cumberstone ...Code example filling the VRAM
... because the garbage collector does not have the time to free the VRAM when
context
/session
vars are replaced. We can have a workaround by exposing node GC and run manually but it depends of the environment (and in my case not possible).Additionnal note, if we sleep for like 10 seconds between each loop, the GC have the time to free the vram. But also, it is not a really nice solution.
The Solution
Maybe having something like a
LlamaContext.unload()
orLlamaChatSession.unload()
, letting us free the VRAM for another context/session ?Considered Alternatives
I don't have any other solution to have a method for unloading the VRAM directly from the objects instead of relying on the node GC.
Additional Context
I have read some related problems in the python wrapper side.
Maybe it can be helpful ?
abetlen/llama-cpp-python#223
Related Features to This Feature Request
Are you willing to resolve this issue by submitting a Pull Request?
No, I don’t have the time, but I can support (using donations) development.
The text was updated successfully, but these errors were encountered: