Copy-in embeddings in reduced precision and handle precision conversion during inference #73

mikepapadim · 2025-11-26T14:10:14Z

No description provided.

…n in TornadoVM acceleration.

…el loaders for consistent tensor loading.

Copilot

Pull request overview

This PR introduces precision-aware embedding handling by storing embeddings in FP16 format and performing on-the-fly conversion to FP32 during GPU inference, reducing memory footprint while maintaining computational accuracy.

Embeddings are now stored in reduced precision (FP16) and converted to FP32 on GPU during inference
Introduced new TornadoVM conversion kernels (convertFP16toFP32 and convertFP32toFP16) for efficient precision conversion
Refactored tensor loading to support both standard and TornadoVM-compatible memory layouts with proper header handling

Reviewed changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated 24 comments.

Show a summary per file

File	Description
src/main/java/org/beehive/gpullama3/tornadovm/layers/Activation.java	Updated activation layer to use FP16→FP32 conversion kernel instead of empty copy task
src/main/java/org/beehive/gpullama3/tornadovm/kernels/TransformerComputeKernels.java	Added GPU kernels for FP16↔FP32 precision conversion
src/main/java/org/beehive/gpullama3/tensor/tornado/TornadoTensor.java	Removed size tracking from base class (moved to subclasses), added spacing to documentation
src/main/java/org/beehive/gpullama3/tensor/tornado/Q8_0TornadoTensor.java	Moved size field from parent to this class, added TODO for Q8_0 loading fix
src/main/java/org/beehive/gpullama3/tensor/tornado/FP32TornadoTensor.java	Renamed field to `tornadoNativeArray`, added factory method for memory segment loading
src/main/java/org/beehive/gpullama3/tensor/tornado/FP16TornadoTensor.java	Renamed field to `tornadoNativeArray`, added factory method for memory segment loading
src/main/java/org/beehive/gpullama3/tensor/GGUF.java	Split tensor loading into `loadTensorsStandard` and `loadTensorsTornado` with proper memory layout handling
src/main/java/org/beehive/gpullama3/model/loader/*.java	Updated model loaders to remove `loadWeights` parameter and use `loadTornadoTensor` instead of `loadTornadoTensorAsFP32`
src/main/java/org/beehive/gpullama3/model/loader/ModelLoader.java	Updated to always load weights, replaced manual FP32 conversion with GPU-based approach
src/main/java/org/beehive/gpullama3/model/loader/AbstractModelLoader.java	Removed `loadWeights` parameter, added logic to select tensor loading method based on TornadoVM usage
src/main/java/org/beehive/gpullama3/model/ModelType.java	Updated all model type loaders to remove `loadWeights` parameter
src/main/java/org/beehive/gpullama3/inference/state/*.java	Added `embeddingX` field (HalfFloatArray) to all state classes for FP16 embedding storage
src/main/java/org/beehive/gpullama3/inference/InferenceCore.java	Modified embedding copy to use FP16 format (Short.BYTES) targeting `state.embeddingX`
set_paths	Commented out TornadoVM path configuration variables
.github/workflows/build-and-run.yml	Disabled Spotless formatting check, switched TornadoVM clone to develop branch

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-26T14:19:29Z

src/main/java/org/beehive/gpullama3/tensor/GGUF.java

+            System.out.println("[GGUF] fileChannel = FileChannel.open(modelPath, READ, WRITE);");
+            fileChannel = FileChannel.open(modelPath, READ, WRITE);


File opened with WRITE permission unnecessarily. The file is opened with READ and WRITE permissions, but based on the method name and usage, only read access should be required for loading metadata. Opening with write permissions creates an unnecessary security risk. Consider using only READ permission unless write access is actually needed.

Suggested change

System.out.println("[GGUF] fileChannel = FileChannel.open(modelPath, READ, WRITE);");

fileChannel = FileChannel.open(modelPath, READ, WRITE);

System.out.println("[GGUF] fileChannel = FileChannel.open(modelPath, READ);");

fileChannel = FileChannel.open(modelPath, READ);

Copilot · 2025-11-26T14:19:30Z

src/main/java/org/beehive/gpullama3/tensor/GGUF.java

+     * Loads GGUF tensor data using a TornadoVM-compatible memory layout.
+     *
+     * <p>This method parses the GGUF tensor list and memory-maps each tensor
+     * in {@link TornadoNativeArray} layout directly from the underlying{@link FileChannel}.


Missing space in JavaDoc. There should be a space between 'underlying' and the opening brace.

Suggested change

* in {@link TornadoNativeArray} layout directly from the underlying{@link FileChannel}.

* in {@link TornadoNativeArray} layout directly from the underlying {@link FileChannel}.

Copilot · 2025-11-26T14:19:30Z

src/main/java/org/beehive/gpullama3/model/loader/Qwen2ModelLoader.java

        );

    }
+    // @formatter:off


Incorrect formatter comment. This should be @formatter:on to match the opening @formatter:off comment, not @formatter:off.

Suggested change

// @formatter:off

// @formatter:on

Copilot · 2025-11-26T14:19:31Z

src/main/java/org/beehive/gpullama3/model/loader/Qwen3ModelLoader.java

                : new ChatTokens("<|im_start|>", "<|im_end|>", "", "<|end_of_text|>", "<|endoftext|>");
        return new Qwen3(config, tokenizer, weights, ChatFormat.create(tokenizer, chatTokens));
    }
+    // @formatter:off


Incorrect formatter comment. This should be @formatter:on to match the opening @formatter:off comment, not @formatter:off.

Suggested change

// @formatter:off

// @formatter:on

Copilot · 2025-11-26T14:19:31Z

src/main/java/org/beehive/gpullama3/model/loader/Phi3ModelLoader.java

        );
        return config;
    }
+    // @formatter:off


Incorrect formatter comment. This should be @formatter:on to match the opening @formatter:off comment, not @formatter:off.

Suggested change

// @formatter:off

// @formatter:on

Copilot · 2025-11-26T14:19:36Z

.github/workflows/build-and-run.yml

        run: |
          cd ${{ github.workspace }}
-          ./mvnw -T12C -Pspotless spotless:check
+          #./mvnw -T12C -Pspotless spotless:check


Code formatting check is commented out. The Spotless code formatting check has been commented out, which means code style violations will not be caught in CI. This should be re-enabled to maintain code quality standards, or removed entirely if no longer needed.

Suggested change

#./mvnw -T12C -Pspotless spotless:check

./mvnw -T12C -Pspotless spotless:check

Copilot · 2025-11-26T14:19:36Z

src/main/java/org/beehive/gpullama3/model/loader/ModelLoader.java

+     * Load a tensor and manually convert to FP32 (FloatArray).
+     * Used for embeddings that currently are treated as FP32.
+     * TODO: it is ultra-slow and should be removed


Outdated documentation. The JavaDoc states "Used for embeddings that currently are treated as FP32" and includes a TODO about removal, but based on the PR changes, embeddings are now handled as FP16 with GPU-based conversion. The documentation should be updated to clarify the current usage or deprecation status of this method.

Suggested change

* Load a tensor and manually convert to FP32 (FloatArray).

* Used for embeddings that currently are treated as FP32.

* TODO: it is ultra-slow and should be removed

* [DEPRECATED] Loads a tensor and manually converts it to FP32 (FloatArray).

* Embeddings are now handled as FP16 with GPU-based conversion; this method remains only for legacy compatibility.

* This method is ultra-slow and should not be used in new code. Scheduled for removal.

Copilot · 2025-11-26T14:19:36Z

src/main/java/org/beehive/gpullama3/tensor/tornado/FP32TornadoTensor.java


 public class FP32TornadoTensor extends TornadoTensor {
-    private final FloatArray values;
+    private final FloatArray tornadoNativeArray;


Renamed field lacks consistency. The field was renamed from values to tornadoNativeArray, but this naming is inconsistent with the sibling class FP16TornadoTensor which also uses tornadoNativeArray. While the naming is now consistent between the two classes, a more descriptive name like data or array would be clearer since the "Tornado" context is already evident from the class name.

Copilot · 2025-11-26T14:19:37Z

src/main/java/org/beehive/gpullama3/inference/state/Qwen3State.java

        fields.wrapK = new FloatArray(nEmbdKGqa);
        fields.wrapV = new FloatArray(nEmbdKGqa);
-
+        fields.embeddingX = new HalfFloatArray(config.dim());


Duplicate field initialization. The field fields.embeddingX is initialized twice - once at line 70 and again at line 80 with the same value. The second initialization should be removed.

Suggested change

fields.embeddingX = new HalfFloatArray(config.dim());

// Removed duplicate initialization of fields.embeddingX

Copilot · 2025-11-26T14:19:37Z

src/main/java/org/beehive/gpullama3/tornadovm/kernels/TransformerComputeKernels.java

+        wrapX.set(i, x.get(i).getFloat32());
+    }
+
+    public static void convertFP32toFP16(KernelContext context,  FloatArray wrapX, HalfFloatArray x) {


Extra whitespace after parameter. There are two spaces between the comma and FloatArray - should be one space.

Suggested change

public static void convertFP32toFP16(KernelContext context, FloatArray wrapX, HalfFloatArray x) {

public static void convertFP32toFP16(KernelContext context, FloatArray wrapX, HalfFloatArray x) {

mikepapadim added 2 commits November 26, 2025 16:05

Refactor tensor loading and introduce support for Half-Float precisio…

d0966eb

…n in TornadoVM acceleration.

Replace loadTornadoTensorAsFP32 with loadTornadoTensor across mod…

61993fc

…el loaders for consistent tensor loading.

mikepapadim requested review from Copilot, mairooni and orionpapadakis November 26, 2025 14:10

mikepapadim self-assigned this Nov 26, 2025

mikepapadim mentioned this pull request Nov 26, 2025

Feat/fp16 embeddings #69

Closed

Copilot started reviewing on behalf of mikepapadim November 26, 2025 14:10 View session

Copilot finished reviewing on behalf of mikepapadim November 26, 2025 14:14

Copilot AI reviewed Nov 26, 2025

View reviewed changes

mikepapadim force-pushed the main branch from b2c1c1c to 51c52bc Compare November 27, 2025 20:52

[CI] point to master tornadovm

937408f

orionpapadakis force-pushed the feat/fp16-emb branch from 20b0d4c to 937408f Compare November 28, 2025 12:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Copy-in embeddings in reduced precision and handle precision conversion during inference #73

Copy-in embeddings in reduced precision and handle precision conversion during inference #73

mikepapadim commented Nov 26, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Nov 26, 2025

Uh oh!

Copilot AI Nov 26, 2025

Uh oh!

Copilot AI Nov 26, 2025

Uh oh!

Copilot AI Nov 26, 2025

Uh oh!

Copilot AI Nov 26, 2025

Uh oh!

Copilot AI Nov 26, 2025

Uh oh!

Copilot AI Nov 26, 2025

Uh oh!

Copilot AI Nov 26, 2025

Uh oh!

Copilot AI Nov 26, 2025

Uh oh!

Copilot AI Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		System.out.println("[GGUF] fileChannel = FileChannel.open(modelPath, READ, WRITE);");
		fileChannel = FileChannel.open(modelPath, READ, WRITE);

	* in {@link TornadoNativeArray} layout directly from the underlying{@link FileChannel}.
	* in {@link TornadoNativeArray} layout directly from the underlying {@link FileChannel}.

	#./mvnw -T12C -Pspotless spotless:check
	./mvnw -T12C -Pspotless spotless:check

	fields.embeddingX = new HalfFloatArray(config.dim());
	// Removed duplicate initialization of fields.embeddingX

	public static void convertFP32toFP16(KernelContext context, FloatArray wrapX, HalfFloatArray x) {
	public static void convertFP32toFP16(KernelContext context, FloatArray wrapX, HalfFloatArray x) {

Copy-in embeddings in reduced precision and handle precision conversion during inference #73

Are you sure you want to change the base?

Copy-in embeddings in reduced precision and handle precision conversion during inference #73

Conversation

mikepapadim commented Nov 26, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants