Quantized matmul with CUDA sets the result to zero instead of properly computing it

**SOLVED!** Read the thread for the investigation details and the solution.

---

In [rwkv.cpp](https://github.com/saharNooby/rwkv.cpp), I'm updating `ggml` from [commit a1d0ea7](https://github.com/ggerganov/ggml/tree/a1d0ea7c2abd44f56822ffdfcfe0a0fcf7170885) to the most recent [commit 8ca2c19](https://github.com/ggerganov/ggml/tree/8ca2c19a3bb8622954d858fbf6383522684eaf34).

After the update, FP32, FP16 and quantized inference on CPU works. FP32 and FP16 inference on GPU (CUDA) also works.

**However, quantized inference on GPU (CUDA) does not work**: it silently leaves result tensors filled with zeros. I'm using the same offloading method that worked fine before: set tensor's `backend`, call `ggml_cuda_transform_tensor`.

Here's a minimal code that reproduces the behavior:

```c++
#include <ggml.h>

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

#define SET_ELEMENT_F32(tensor, i, value) ((float *) tensor->data)[i] = value

void run_test(bool offload) {
struct ggml_init_params params = {
        .mem_size   = 16 * 1024,
        .mem_buffer = NULL,
        .no_alloc   = false,
    };

    struct ggml_context * ctx = ggml_init(params);

    // ---

    struct ggml_tensor * x = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, 32, 1);

    for (int i = 0; i < 32; i++) {
        SET_ELEMENT_F32(x, i, 1.0F * i);
    }

    // ---

    struct ggml_tensor * x_quantized = ggml_new_tensor_2d(ctx, GGML_TYPE_Q4_0, 32, 1);

    int64_t hist[16];
    ggml_quantize_chunk(x_quantized->type, (const float *) x->data, x_quantized->data, 0, 32, hist);

    if (offload) {
        x->backend = GGML_BACKEND_GPU;
        ggml_cuda_transform_tensor(x->data, x);
        
        x_quantized->backend = GGML_BACKEND_GPU;
        ggml_cuda_transform_tensor(x_quantized->data, x_quantized);
    }

    // ---

    struct ggml_tensor * y = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, 32);

    for (int i = 0; i < 32; i++) {
        SET_ELEMENT_F32(y, i, 1.0F * i);
    }

    // ---

    struct ggml_tensor * mul0 = ggml_mul_mat(ctx, x, y);
    struct ggml_tensor * mul1 = ggml_mul_mat(ctx, x_quantized, y);

    struct ggml_cgraph graph = ggml_build_forward(mul0);

    ggml_build_forward_expand(&graph, mul1);

    struct ggml_cplan plan = ggml_graph_plan(&graph, 2);

    uint8_t * work_data = (uint8_t *) malloc(plan.work_size);
    plan.work_data = work_data;

    ggml_graph_compute(&graph, &plan);

    free(work_data);

    fprintf(stderr, "---\n");
    fprintf(stderr, "offload = %d\n", offload);
    fprintf(stderr, "FP32 result = %f\n", ((float *) mul0->data)[0]);
    fprintf(stderr, "Q4_0 result = %f\n", ((float *) mul1->data)[0]);

    ggml_free(ctx);
}

int main(void) {
    #ifdef GGML_USE_CUBLAS

    run_test(false);
    run_test(true);

    #endif

    return 0;
}
```

On my Windows 10 machine it prints:

```
---
offload = 0
FP32 result = 10416.000000
Q4_0 result = 10361.083984
---
offload = 1
FP32 result = 10416.000000
Q4_0 result = 0.000000
```

I expect `Q4_0 result` when offloading to be equal to the corresponding result when offload is not performed.

I'm 90% sure that this is not a bug in `ggml`, but I am doing something wrong. How the code above can be fixed?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Quantized matmul with CUDA sets the result to zero instead of properly computing it #529

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Quantized matmul with CUDA sets the result to zero instead of properly computing it #529

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions