-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Description
SOLVED! Read the thread for the investigation details and the solution.
In rwkv.cpp, I'm updating ggml
from commit a1d0ea7 to the most recent commit 8ca2c19.
After the update, FP32, FP16 and quantized inference on CPU works. FP32 and FP16 inference on GPU (CUDA) also works.
However, quantized inference on GPU (CUDA) does not work: it silently leaves result tensors filled with zeros. I'm using the same offloading method that worked fine before: set tensor's backend
, call ggml_cuda_transform_tensor
.
Here's a minimal code that reproduces the behavior:
#include <ggml.h>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#define SET_ELEMENT_F32(tensor, i, value) ((float *) tensor->data)[i] = value
void run_test(bool offload) {
struct ggml_init_params params = {
.mem_size = 16 * 1024,
.mem_buffer = NULL,
.no_alloc = false,
};
struct ggml_context * ctx = ggml_init(params);
// ---
struct ggml_tensor * x = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, 32, 1);
for (int i = 0; i < 32; i++) {
SET_ELEMENT_F32(x, i, 1.0F * i);
}
// ---
struct ggml_tensor * x_quantized = ggml_new_tensor_2d(ctx, GGML_TYPE_Q4_0, 32, 1);
int64_t hist[16];
ggml_quantize_chunk(x_quantized->type, (const float *) x->data, x_quantized->data, 0, 32, hist);
if (offload) {
x->backend = GGML_BACKEND_GPU;
ggml_cuda_transform_tensor(x->data, x);
x_quantized->backend = GGML_BACKEND_GPU;
ggml_cuda_transform_tensor(x_quantized->data, x_quantized);
}
// ---
struct ggml_tensor * y = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, 32);
for (int i = 0; i < 32; i++) {
SET_ELEMENT_F32(y, i, 1.0F * i);
}
// ---
struct ggml_tensor * mul0 = ggml_mul_mat(ctx, x, y);
struct ggml_tensor * mul1 = ggml_mul_mat(ctx, x_quantized, y);
struct ggml_cgraph graph = ggml_build_forward(mul0);
ggml_build_forward_expand(&graph, mul1);
struct ggml_cplan plan = ggml_graph_plan(&graph, 2);
uint8_t * work_data = (uint8_t *) malloc(plan.work_size);
plan.work_data = work_data;
ggml_graph_compute(&graph, &plan);
free(work_data);
fprintf(stderr, "---\n");
fprintf(stderr, "offload = %d\n", offload);
fprintf(stderr, "FP32 result = %f\n", ((float *) mul0->data)[0]);
fprintf(stderr, "Q4_0 result = %f\n", ((float *) mul1->data)[0]);
ggml_free(ctx);
}
int main(void) {
#ifdef GGML_USE_CUBLAS
run_test(false);
run_test(true);
#endif
return 0;
}
On my Windows 10 machine it prints:
---
offload = 0
FP32 result = 10416.000000
Q4_0 result = 10361.083984
---
offload = 1
FP32 result = 10416.000000
Q4_0 result = 0.000000
I expect Q4_0 result
when offloading to be equal to the corresponding result when offload is not performed.
I'm 90% sure that this is not a bug in ggml
, but I am doing something wrong. How the code above can be fixed?