Free VRAM programmatically instead with GC #303

IfnotFr · 2024-08-31T12:19:54Z

Feature Description

Actually LlamaContext and LlamaChatSession VRAM unallocation is done by the GC when we unset the variable containing them. But relying on the GC for freeing the VRAM can be complicated if we want to programatically handle multiple context/sessions.

For example in my application I need to make inferences to multiple chat contexts (differents prompts, histories ...). Actually I fork a worker.js everytime and rely on the child process kill to free the VRAM. But it is slow and cumberstone ...

Code example filling the VRAM

... because the garbage collector does not have the time to free the VRAM when context / session vars are replaced. We can have a workaround by exposing node GC and run manually but it depends of the environment (and in my case not possible).

// ...
let llama = await getLlama()
const model = await llama.loadModel({
    modelPath: path.join(__dirname, "models", "dolphin-2.1-mistral-7b.Q4_K_M.gguf")
})

let context
let session
let i = 0
while (true) {
  i++
  context = await model.createContext()
  session = new LlamaChatSession({
    contextSequence: context.getSequence()
  })

  const q1 = 'Hi there, how are you?'
  console.log(`${i} User: ${q1}`)

  const a1 = await session.prompt(q1)
  console.log(`${i} AI: ${a1}`)

  const q2 = 'Summarize what you said'
  console.log(`${i} User: ${q2}`)

  const a2 = await session.prompt(q2)
  console.log(`${i} AI: ${a2}`)
}

Additionnal note, if we sleep for like 10 seconds between each loop, the GC have the time to free the vram. But also, it is not a really nice solution.

The Solution

Maybe having something like a LlamaContext.unload() or LlamaChatSession.unload(), letting us free the VRAM for another context/session ?

Considered Alternatives

I don't have any other solution to have a method for unloading the VRAM directly from the objects instead of relying on the node GC.

Additional Context

I have read some related problems in the python wrapper side.

Maybe it can be helpful ?

abetlen/llama-cpp-python#223

Related Features to This Feature Request

Metal support
CUDA support
Grammar

Are you willing to resolve this issue by submitting a Pull Request?

No, I don’t have the time, but I can support (using donations) development.

The text was updated successfully, but these errors were encountered:

giladgd · 2024-08-31T13:31:10Z

There's already a .dispose() function available on all the objects that you can use:

import {fileURLToPath} from "url";
import path from "path";
import {getLlama, LlamaChatSession} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

const llama = await getLlama();
const model = await llama.loadModel({
    modelPath: path.join(__dirname, "models", "dolphin-2.1-mistral-7b.Q4_K_M.gguf")
});
const context = await model.createContext();
console.log("VRAM usage", (await llama.getVramState()).used);

await context.dispose(); // dispose the context
console.log("VRAM usage", (await llama.getVramState()).used);

await model.dispose(); // dispose the model and all of its contexts
console.log("VRAM usage", (await llama.getVramState()).used);

You can also use await using to automatically dispose things when they become out of scope:

import {fileURLToPath} from "url";
import path from "path";
import {getLlama, LlamaChatSession} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

const llama = await getLlama();
{
    await using model = await llama.loadModel({
        modelPath: path.join(__dirname, "models", "dolphin-2.1-mistral-7b.Q4_K_M.gguf")
    });
    console.log("VRAM usage", (await llama.getVramState()).used);
}

// the model will be automatically disposed when this line is reached
console.log("VRAM usage", (await llama.getVramState()).used);

IfnotFr · 2024-08-31T15:29:24Z

Damn, I tried with dispose with no luck. I may did something wrong.

Thank you for this rapid answer, sorry for the dumb question.

Hope it will at least help people with same problem as me.

IfnotFr added new feature New feature or request requires triage Requires triaging labels Aug 31, 2024

giladgd closed this as completed Aug 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Free VRAM programmatically instead with GC #303

Free VRAM programmatically instead with GC #303

IfnotFr commented Aug 31, 2024

giladgd commented Aug 31, 2024

Uh oh!

IfnotFr commented Aug 31, 2024

Uh oh!

Uh oh!

Free VRAM programmatically instead with GC #303

Free VRAM programmatically instead with GC #303

Comments

IfnotFr commented Aug 31, 2024

Feature Description

Code example filling the VRAM

The Solution

Considered Alternatives

Additional Context

Related Features to This Feature Request

Are you willing to resolve this issue by submitting a Pull Request?

giladgd commented Aug 31, 2024

Uh oh!

IfnotFr commented Aug 31, 2024

Uh oh!