Memory efficient context handling #1183

zsogitbe · 2025-05-10T06:42:43Z

Martin, in the LLamaEmbedderTests in CompareEmbeddings() I had to disable a Microsoft.Extensions.AI.IEmbeddingGenerator related code segment that does not work with the new efficient context handling. Please look at that code to decide what can be done to use it or we can also remove it. I guess that if we want that kind of functionality, then we will need to create a LLamaEmbedderService that is compatible with Microsoft.Extensions.AI.IEmbeddingGenerator.

I had to disable in SafeLlamaModelHandleTests -> MetadataValByKey_ReturnsCorrectly for Mac and Linux. On Windows it is OK. Please look at this also to decide if this is important. This has nothing to do with this PR, but I guess with llama.cpp.

martindevans · 2025-05-11T19:13:32Z

LLama/LLamaEmbedder.cs

+    /// </summary>
+    /// <param name="text"></param>
+    /// <returns></returns>
+    public int CountTokens(string text)


CountTokens and GetTokens methods are duplicated on LLamaEmbedder and LLamaStatelessExecutor. Also I don't think either of them actually requires a context (which is a very expensive object to create and destroy)!

Can these methods be moved up to LLamaWeights class instead? That's a more appropriate place for methods relating to tokens/vocabulary.

CountTokens and GetTokens methods are duplicated on LLamaEmbedder and LLamaStatelessExecutor.

The reason for this is because the contexts are made with the parameters of each specific object (text generator or embedding generator).

Also I don't think either of them actually requires a context (which is a very expensive object to create and destroy)!

The code is now streamlined to not have the context around, but created when needed and then destroyed. The logic of doing this is the same as in each object itself, for example, GetEmbeddingsWithTokenCount() does the same in LLamaEmbedder, and InferAsync() does the same in StatelessExecutor. So, the code is logical in all sense now. The overhead of creating the context on the fly is very small, and when using KernelMemory with this update, compared to before, 30% less GPU memory is used.

Can these methods be moved up to LLamaWeights class instead? That's a more appropriate place for methods relating to tokens/vocabulary.

If we would move them to LLamaWeights, then we would need to change the code to keep the parameters in each object (required to make the context - different for LLamaEmbedder and for LLamaStatelessExecutor), and pass these parameters on several places in the code to these functions, etc. A lot of modifications on several places where these functions are used. I think that the simplest and cleanest would be to leave them where they are now.

As a conclusion, I would keep it how it is now. Please let me know what you think.

Martin, on a second thought, I think that you may be right (only the params need to be kept). I will look at it!

martindevans · 2025-05-11T19:13:47Z

LLama/LLamaStatelessExecutor.cs

        }
+
+        /// <inheritdoc/>
+        public int CountTokens(string text)


See other comment about these methods

zsogitbe · 2025-05-12T08:08:19Z

Moved the code to LLamaWeights. You were right Martin, it is better there.

martindevans · 2025-05-14T21:56:09Z

LLama.Unittest/LLamaEmbedderTests.cs

-        var embeddings = await generator.GenerateAsync(
-        [
-            "The cat is cute",
+        if (false)


Does this need resolving before merge?

generator.GetService<EmbeddingGeneratorMetadata>() uses the context and thus will fail because for a context efficient handling we do not keep the context. This was the main aim of this PR.

The code in the test assumes that there is a context. I think that for the test code to work we would need some extra work to create an embedding service that keeps the context (this could be done in a next PR, if anybody is interested in to do it). The aim of the embedder in our code is different. My opinion is that the test code is wrong because it assume that the embedder is a live service, and it should not be for efficiently handling of GPU memory. There are two options, delete the test code or leave it in switched off with the TODO comment I have added.

martindevans · 2025-05-14T21:59:34Z

LLama/LLamaWeights.cs

+            using var context = CreateContext(parameters);
+            var count = context.Tokenize(text, special: true).Length;
+            return count;


There's a Tokenize method on the lower level model handle, no need to use a context: https://github.com/SciSharp/LLamaSharp/blob/master/LLama/Native/SafeLlamaModelHandle.cs#L480

(i.e. just writing a wrapper over that method should suffice)

I did not realize that there is a tokenize method on parent level. I will use that then and since it does not need a context I can move all code back to KM. The main aim of this PR was to remove the context that was create on several places because it unnecessarily fills GPU memory (saves about 30% of memory!).

martindevans · 2025-05-14T22:17:00Z

LLama/LLamaWeights.cs

+        /// <remarks>
+        /// It throws if text is null and Includes empty stop token because addBos is left true to be consistent with the CountTokens implementation.</remarks>
+        /// <see cref="CountTokens(string, IContextParams)"/>
+        public IReadOnlyList<string> GetTokens(string text, IContextParams parameters)


This implementation doesn't seem correct to me (I realise you're just moving it from LlamaSharpTextGenerator to LLamaWeights, but I don't really work on the KM stuff, so I haven't closely reviewed these methods before).

In general LLamaSharp is quite careful about never treating tokens as text, it's not safe for a number of reasons - for example a token could be half of a character, in which case it can't be decoded into text. That's what the StreamingTokenDecoder is for, you could add 10 tokens and get back just one character of text. At the very least, that means that GetTokens and CountTokens would have a mismatch.

Obviously KM needs something back to satisfy the contract of ITextTokenizer etc, so I'm not really sure what the right answer is here. Maybe we should move back into KM, an an extension method on LLamaWeights? That way you can still use it as if it's here, but it's not part of the main lib. I'm open to other ideas though.

The main aim of this PR was to remove the context that was created on several places in the KM code because it unnecessarily fills GPU memory (this saves about 30% of memory!). With using the native tokenize method and moving back the code to KM, I think that we have the right solution.

…LamaSharp into UpdateContextHandling

zsogitbe · 2025-05-15T03:38:11Z

I have updated the code.

martindevans · 2025-05-16T00:58:28Z

Just gave this a quick skim, looks pretty good after those last changes 👍

I'll try to find time tomorrow to give it one alst in depth review and get it merged.

martindevans · 2025-05-17T17:25:00Z

Thanks for all the work on this!

zsogitbe · 2025-05-18T08:25:24Z

Thanks for all the work on this!

You too, Martin!

zsogitbe added 2 commits May 10, 2025 08:36

Memory efficient context handling

7e714d0

Memory efficient context handling

925ca06

martindevans requested changes May 11, 2025

View reviewed changes

zsogitbe added 3 commits May 12, 2025 09:55

Memory efficient context handling

5f35b8e

Memory efficient context handling

17cb2a0

Merge branch 'master' into UpdateContextHandling

f77f80c

zsogitbe requested a review from martindevans May 12, 2025 08:10

martindevans reviewed May 14, 2025

View reviewed changes

zsogitbe added 2 commits May 15, 2025 05:30

Memory efficient context handling

f7fdaac

Merge branch 'UpdateContextHandling' of https://github.com/zsogitbe/L…

da4e62d

…LamaSharp into UpdateContextHandling

martindevans approved these changes May 17, 2025

View reviewed changes

martindevans merged commit 0339b03 into SciSharp:master May 17, 2025
6 checks passed

bmazzarol-bunnings mentioned this pull request Sep 28, 2025

[BUG]: GenerateAsync via the IEmbeddingGenerator interface throws ObjectDisposedException on LLama.Native.SafeLLamaContextHandle #1259

Open

This was referenced Oct 20, 2025

Bump LLamaSharp from 0.24.0 to 0.25.0 MiguelDomingues/md-localize#130

Merged

Bump LLamaSharp.Backend.Cpu from 0.24.0 to 0.25.0 MiguelDomingues/md-localize#131

Merged

Bump LLamaSharp.Backend.Vulkan from 0.24.0 to 0.25.0 MiguelDomingues/md-localize#132

Merged

Uh oh!

Memory efficient context handling #1183

Memory efficient context handling #1183

Uh oh!

Conversation

zsogitbe commented May 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zsogitbe commented May 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zsogitbe May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zsogitbe commented May 15, 2025

Uh oh!

martindevans commented May 16, 2025

Uh oh!

martindevans commented May 17, 2025

Uh oh!

Uh oh!

zsogitbe commented May 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zsogitbe commented May 10, 2025 •

edited

Loading

zsogitbe May 15, 2025 •

edited

Loading