Skip to content

Quantized matmul with CUDA sets the result to zero instead of properly computing it #529

@saharNooby

Description

@saharNooby

SOLVED! Read the thread for the investigation details and the solution.


In rwkv.cpp, I'm updating ggml from commit a1d0ea7 to the most recent commit 8ca2c19.

After the update, FP32, FP16 and quantized inference on CPU works. FP32 and FP16 inference on GPU (CUDA) also works.

However, quantized inference on GPU (CUDA) does not work: it silently leaves result tensors filled with zeros. I'm using the same offloading method that worked fine before: set tensor's backend, call ggml_cuda_transform_tensor.

Here's a minimal code that reproduces the behavior:

#include <ggml.h>

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

#define SET_ELEMENT_F32(tensor, i, value) ((float *) tensor->data)[i] = value

void run_test(bool offload) {
struct ggml_init_params params = {
        .mem_size   = 16 * 1024,
        .mem_buffer = NULL,
        .no_alloc   = false,
    };

    struct ggml_context * ctx = ggml_init(params);

    // ---

    struct ggml_tensor * x = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, 32, 1);

    for (int i = 0; i < 32; i++) {
        SET_ELEMENT_F32(x, i, 1.0F * i);
    }

    // ---

    struct ggml_tensor * x_quantized = ggml_new_tensor_2d(ctx, GGML_TYPE_Q4_0, 32, 1);

    int64_t hist[16];
    ggml_quantize_chunk(x_quantized->type, (const float *) x->data, x_quantized->data, 0, 32, hist);

    if (offload) {
        x->backend = GGML_BACKEND_GPU;
        ggml_cuda_transform_tensor(x->data, x);
        
        x_quantized->backend = GGML_BACKEND_GPU;
        ggml_cuda_transform_tensor(x_quantized->data, x_quantized);
    }

    // ---

    struct ggml_tensor * y = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, 32);

    for (int i = 0; i < 32; i++) {
        SET_ELEMENT_F32(y, i, 1.0F * i);
    }

    // ---

    struct ggml_tensor * mul0 = ggml_mul_mat(ctx, x, y);
    struct ggml_tensor * mul1 = ggml_mul_mat(ctx, x_quantized, y);

    struct ggml_cgraph graph = ggml_build_forward(mul0);

    ggml_build_forward_expand(&graph, mul1);

    struct ggml_cplan plan = ggml_graph_plan(&graph, 2);

    uint8_t * work_data = (uint8_t *) malloc(plan.work_size);
    plan.work_data = work_data;

    ggml_graph_compute(&graph, &plan);

    free(work_data);

    fprintf(stderr, "---\n");
    fprintf(stderr, "offload = %d\n", offload);
    fprintf(stderr, "FP32 result = %f\n", ((float *) mul0->data)[0]);
    fprintf(stderr, "Q4_0 result = %f\n", ((float *) mul1->data)[0]);

    ggml_free(ctx);
}

int main(void) {
    #ifdef GGML_USE_CUBLAS

    run_test(false);
    run_test(true);

    #endif

    return 0;
}

On my Windows 10 machine it prints:

---
offload = 0
FP32 result = 10416.000000
Q4_0 result = 10361.083984
---
offload = 1
FP32 result = 10416.000000
Q4_0 result = 0.000000

I expect Q4_0 result when offloading to be equal to the corresponding result when offload is not performed.

I'm 90% sure that this is not a bug in ggml, but I am doing something wrong. How the code above can be fixed?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions