Skip to content

Free VRAM programmatically instead with GC #303

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks
IfnotFr opened this issue Aug 31, 2024 · 2 comments
Closed
3 tasks

Free VRAM programmatically instead with GC #303

IfnotFr opened this issue Aug 31, 2024 · 2 comments
Labels
new feature New feature or request requires triage Requires triaging

Comments

@IfnotFr
Copy link

IfnotFr commented Aug 31, 2024

Feature Description

Actually LlamaContext and LlamaChatSession VRAM unallocation is done by the GC when we unset the variable containing them. But relying on the GC for freeing the VRAM can be complicated if we want to programatically handle multiple context/sessions.

For example in my application I need to make inferences to multiple chat contexts (differents prompts, histories ...). Actually I fork a worker.js everytime and rely on the child process kill to free the VRAM. But it is slow and cumberstone ...

Code example filling the VRAM

... because the garbage collector does not have the time to free the VRAM when context / session vars are replaced. We can have a workaround by exposing node GC and run manually but it depends of the environment (and in my case not possible).

// ...
let llama = await getLlama()
const model = await llama.loadModel({
    modelPath: path.join(__dirname, "models", "dolphin-2.1-mistral-7b.Q4_K_M.gguf")
})

let context
let session
let i = 0
while (true) {
  i++
  context = await model.createContext()
  session = new LlamaChatSession({
    contextSequence: context.getSequence()
  })

  const q1 = 'Hi there, how are you?'
  console.log(`${i} User: ${q1}`)

  const a1 = await session.prompt(q1)
  console.log(`${i} AI: ${a1}`)

  const q2 = 'Summarize what you said'
  console.log(`${i} User: ${q2}`)

  const a2 = await session.prompt(q2)
  console.log(`${i} AI: ${a2}`)
}

Additionnal note, if we sleep for like 10 seconds between each loop, the GC have the time to free the vram. But also, it is not a really nice solution.

The Solution

Maybe having something like a LlamaContext.unload() or LlamaChatSession.unload(), letting us free the VRAM for another context/session ?

Considered Alternatives

I don't have any other solution to have a method for unloading the VRAM directly from the objects instead of relying on the node GC.

Additional Context

I have read some related problems in the python wrapper side.

Maybe it can be helpful ?

abetlen/llama-cpp-python#223

Related Features to This Feature Request

  • Metal support
  • CUDA support
  • Grammar

Are you willing to resolve this issue by submitting a Pull Request?

No, I don’t have the time, but I can support (using donations) development.

@IfnotFr IfnotFr added new feature New feature or request requires triage Requires triaging labels Aug 31, 2024
@giladgd
Copy link
Contributor

giladgd commented Aug 31, 2024

There's already a .dispose() function available on all the objects that you can use:

import {fileURLToPath} from "url";
import path from "path";
import {getLlama, LlamaChatSession} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

const llama = await getLlama();
const model = await llama.loadModel({
    modelPath: path.join(__dirname, "models", "dolphin-2.1-mistral-7b.Q4_K_M.gguf")
});
const context = await model.createContext();
console.log("VRAM usage", (await llama.getVramState()).used);

await context.dispose(); // dispose the context
console.log("VRAM usage", (await llama.getVramState()).used);

await model.dispose(); // dispose the model and all of its contexts
console.log("VRAM usage", (await llama.getVramState()).used);

You can also use await using to automatically dispose things when they become out of scope:

import {fileURLToPath} from "url";
import path from "path";
import {getLlama, LlamaChatSession} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

const llama = await getLlama();
{
    await using model = await llama.loadModel({
        modelPath: path.join(__dirname, "models", "dolphin-2.1-mistral-7b.Q4_K_M.gguf")
    });
    console.log("VRAM usage", (await llama.getVramState()).used);
}

// the model will be automatically disposed when this line is reached
console.log("VRAM usage", (await llama.getVramState()).used);

@giladgd giladgd closed this as completed Aug 31, 2024
@IfnotFr
Copy link
Author

IfnotFr commented Aug 31, 2024

Damn, I tried with dispose with no luck. I may did something wrong.

Thank you for this rapid answer, sorry for the dumb question.

Hope it will at least help people with same problem as me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new feature New feature or request requires triage Requires triaging
Projects
None yet
Development

No branches or pull requests

2 participants