[ET-VK][Ops] dequantization op shaders and impl

morelos · morelos · commit 086594e4390d · 2025-06-13T11:03:04.000-07:00
Pull Request resolved: #11483 # Operator Description The dequantization operator converts lower-precision integer tensors (uint8/int8/int32) back to floating-point formats (fp16/fp32) using affine dequantization. This operator supports two dequantization modes: - **Per-tensor dequantization**: Uses a single scale and zero_point for the entire tensor - **Per-token dequantization**: Uses different scale and zero_point values for each "token" (typically rows or channels) The dequantization formula is: `dequantized_value = (quantized_value - zero_point) * scale` **Example**: For a quantized uint8 value `153` with `scale=0.1`, `zero_point=128`: - `(153 - 128) * 0.1 = 25 * 0.1 = 2.5` (float output) The dequantization parameters serve these purposes: - **scale**: Controls the granularity of reconstruction (same scale used during quantization) - **zero_point**: Maps the integer zero representation back to floating-point zero - **quant_min/quant_max**: Define the valid range that was used during original quantization (for validation) # Shader Algorithm Overview ## Texture Storage Implementation (`dequantize_texture.glsl`) The texture-based implementation operates on 3D textures where data is stored in RGBA texel format (4 components per texel): **Per-tensor Mode**: Each compute thread processes one texel position. It loads a 4-component integer texel from the input texture, and applies dequantization to each of the 4 components using shared scale/zero_point parameters. It then writes the dequantized 4-component floating-point result to the output texture. This method processes all components uniformly with the same dequantization parameters. **Per-token Mode**: We need to calculate the token index based on the spatial position, it'll differ between various cases like 3D and 2D. For instance we might define the token_idx as `z * dims.y + y` for 3D, or just `y` for 2D cases. We then retrieve the per-token scale/zero_point from the texture storage according to the token_idx. We need to do component indexing based on the texel_idx and token_idx: `texel_idx = token_idx / 4`, along with the component id `comp_idx = token_idx % 4` to get the necessary scale/zero_point values. We then apply dequantization with the corresponding token-specific parameters to the 4 components of the current texel, converting each integer component to its floating-point representation. ## Buffer Storage Implementation (`dequantize_buffer.glsl`) The buffer-based implementation operates on linear memory buffers with stride-based indexing: **Per-tensor Mode**: In this case, each compute thread will process one element at its global position. It converts the 3D position to linear buffer indices using stride calculations `tidx_to_bufi(pos, strides)`. It then loads single quantized integer values from the input buffer and applies dequantization using shared scale/zero_point parameters. We then store the dequantized floating-point result to the output buffer at the corresponding index. **Per-token Mode**: We first calculate the logical tensor position from the linear buffer index through dimension unwrapping. We then determine the token index based on the tensor dimensionality: - 4D: `token_idx = w * (z * y) + z * y + y` - 3D: `token_idx = z * y + y` - 2D: `token_idx = y` We then directly index into scale/zero_point buffers using token_idx and apply dequantization with the token-specific parameters, converting the quantized integer value back to its original floating-point representation. # Performance Considerations / Future Improvements Current implementation uses default workgroup sizing. Buffer implementation processes one element per thread. Could be optimized to process multiple elements per thread for better throughput. NOTE: Currently the only input types supported are **byte** (uint8), **char** (int8), **int** (int32). The only output types supported are **half** (fp16) and **float** (fp32). A future diff plans to implement **double** (fp64) output dtype support. ghstack-source-id: 290294978 @exported-using-ghexport Differential Revision: [D76267107](https://our.internmc.facebook.com/intern/diff/D76267107/)
diff --git a/backends/vulkan/runtime/graph/ops/glsl/dequantize.glslh b/backends/vulkan/runtime/graph/ops/glsl/dequantize.glslh
@@ -0,0 +1,16 @@
+/*
+ * Copyright (c) Meta Platforms, Inc. and affiliates.
+ * All rights reserved.
+ *
+ * This source code is licensed under the BSD-style license found in the
+ * LICENSE file in the root directory of this source tree.
+ */
+
+#ifndef DEQUANTIZE_GLSLH
+#define DEQUANTIZE_GLSLH
+
+OUT_T dequantize_val(IN_T qvalue, float scale_val, int zero_point_val) {
+  return OUT_T(float(int(qvalue) - zero_point_val) * scale_val);
+}
+
+#endif // DEQUANTIZE_GLSLH
diff --git a/backends/vulkan/runtime/graph/ops/glsl/dequantize_buffer.glsl b/backends/vulkan/runtime/graph/ops/glsl/dequantize_buffer.glsl
@@ -0,0 +1,125 @@
+/*
+ * Copyright (c) Meta Platforms, Inc. and affiliates.
+ * All rights reserved.
+ *
+ * This source code is licensed under the BSD-style license found in the
+ * LICENSE file in the root directory of this source tree.
+ */
+
+#version 450 core
+
+#define PRECISION ${PRECISION}
+
+#define IN_T ${buffer_scalar_type(IN_DTYPE)}
+#define OUT_T ${buffer_scalar_type(OUT_DTYPE)}
+
+${define_active_storage_type("buffer")}
+${define_required_extensions(IN_DTYPE)}
+${define_required_extensions(OUT_DTYPE)}
+
+layout(std430) buffer;
+
+${layout_declare_tensor(B, "r", "t_in", IN_DTYPE, "buffer")}
+${layout_declare_tensor(B, "w", "t_out", OUT_DTYPE, "buffer")}
+
+$if MODE == "per_tensor":
+  layout(push_constant) uniform restrict Block {
+    float scale;
+    int zero_point;
+    int quant_min;
+    int quant_max;
+  };
+$else:
+  ${layout_declare_tensor(B, "r", "t_scale", "float", "buffer")}
+  ${layout_declare_tensor(B, "r", "t_zero_point", "int", "buffer")}
+
+  layout(push_constant) uniform restrict Block {
+    int num_tokens;
+    int quant_min;
+    int quant_max;
+  };
+
+${layout_declare_ubo(B, "ivec4", "t_in_sizes")}
+${layout_declare_ubo(B, "ivec4", "t_in_strides")}
+${layout_declare_ubo(B, "ivec4", "t_out_sizes")}
+${layout_declare_ubo(B, "ivec4", "t_out_strides")}
+
+#include "indexing_utils.h"
+#include "dequantize.glslh"
+
+layout(local_size_x_id = 0, local_size_y_id = 1, local_size_z_id = 2) in;
+
+void main() {
+$if MODE == "per_tensor":
+  const ivec4 pos = ivec4(
+      gl_GlobalInvocationID.x,
+      gl_GlobalInvocationID.y,
+      gl_GlobalInvocationID.z,
+      0);
+
+  const int t_in_idx = tidx_to_bufi(pos, t_in_strides);
+  const int t_out_idx = tidx_to_bufi(pos, t_out_strides);
+
+  IN_T qvalue = t_in[t_in_idx];
+  OUT_T value;
+
+  value = dequantize_val(qvalue, scale, zero_point);
+
+  t_out[t_out_idx] = value;
+
+$if MODE == "per_token":
+  const ivec4 pos = ivec4(
+      gl_GlobalInvocationID.x,
+      gl_GlobalInvocationID.y,
+      gl_GlobalInvocationID.z,
+      0);
+
+  const int t_in_idx = tidx_to_bufi(pos, t_in_strides);
+  const int t_out_idx = tidx_to_bufi(pos, t_out_strides);
+
+  // Skip if out of bounds
+  if (t_in_idx >= t_in_sizes.x * t_in_sizes.y * t_in_sizes.z * t_in_sizes.w) {
+    return;
+  }
+
+  IN_T qvalue = t_in[t_in_idx];
+  OUT_T value;
+
+  // Calculate logical position from linear index and strides
+  ivec4 logical_pos;
+  int remaining = t_in_idx;
+
+  logical_pos.x = remaining % t_in_sizes.x;
+  remaining /= t_in_sizes.x;
+
+  logical_pos.y = remaining % t_in_sizes.y;
+  remaining /= t_in_sizes.y;
+
+  logical_pos.z = remaining % t_in_sizes.z;
+  remaining /= t_in_sizes.z;
+
+  logical_pos.w = remaining;
+
+  // Calculate token index based on logical position
+  int token_idx = 0;
+
+  // Check dimensions to determine how to calculate token_idx
+  if (t_in_sizes.w > 1) {
+    // 4D tensor
+    token_idx = logical_pos.w * (t_in_sizes.z * t_in_sizes.y) + logical_pos.z * t_in_sizes.y + logical_pos.y;
+  } else if (t_in_sizes.z > 1) {
+    // 3D tensor
+    token_idx = logical_pos.z * t_in_sizes.y + logical_pos.y;
+  } else if (t_in_sizes.y > 1) {
+    // 2D tensor
+    token_idx = logical_pos.y;
+  }
+  // For 1D tensor, token_idx remains 0
+
+  // Make sure token_idx is within bounds
+  token_idx = min(token_idx, num_tokens - 1);
+
+  value = dequantize_val(qvalue, t_scale[token_idx], t_zero_point[token_idx]);
+
+  t_out[t_out_idx] = value;
+}
diff --git a/backends/vulkan/runtime/graph/ops/glsl/dequantize_buffer.yaml b/backends/vulkan/runtime/graph/ops/glsl/dequantize_buffer.yaml
@@ -0,0 +1,18 @@
+dequantize_buffer:
+  parameter_names_with_default_values:
+    IN_DTYPE: int32
+    OUT_DTYPE: float
+    MODE: per_tensor
+  generate_variant_forall:
+    IN_DTYPE:
+      - VALUE: uint8
+      - VALUE: int8
+      - VALUE: int32
+    OUT_DTYPE:
+      - VALUE: half
+      - VALUE: float
+  shader_variants:
+    - NAME: dequantize_per_tensor_buffer
+      MODE: per_tensor
+    - NAME: dequantize_per_token_buffer
+      MODE: per_token
diff --git a/backends/vulkan/runtime/graph/ops/glsl/dequantize_texture.glsl b/backends/vulkan/runtime/graph/ops/glsl/dequantize_texture.glsl
@@ -0,0 +1,117 @@
+/*
+ * Copyright (c) Meta Platforms, Inc. and affiliates.
+ * All rights reserved.
+ *
+ * This source code is licensed under the BSD-style license found in the
+ * LICENSE file in the root directory of this source tree.
+ */
+
+#version 450 core
+
+#define PRECISION ${PRECISION}
+
+#define IN_T ${buffer_scalar_type(IN_DTYPE)}
+#define IVEC4_T ${texel_load_type(IN_DTYPE, "texture3d")}
+
+#define OUT_T ${buffer_scalar_type(OUT_DTYPE)}
+#define FVEC4_T ${texel_load_type(OUT_DTYPE, "texture3d")}
+
+${define_active_storage_type("texture3d")}
+${define_required_extensions(IN_DTYPE)}
+${define_required_extensions(OUT_DTYPE)}
+
+#extension GL_EXT_control_flow_attributes : require
+
+layout(std430) buffer;
+
+${layout_declare_tensor(B, "r", "t_in", IN_DTYPE, "texture3d")}
+${layout_declare_tensor(B, "w", "t_out", OUT_DTYPE, "texture3d")}
+
+$if MODE == "per_tensor":
+  layout(push_constant) uniform restrict Block {
+    float scale;
+    int zero_point;
+    int quant_min;
+    int quant_max;
+  };
+$else:
+  ${layout_declare_tensor(B, "r", "t_scale", "float", "texture3d")}
+  ${layout_declare_tensor(B, "r", "t_zero_point", "int", "texture3d")}
+
+  layout(push_constant) uniform restrict Block {
+    int num_tokens;
+    int quant_min;
+    int quant_max;
+  };
+
+${layout_declare_ubo(B, "ivec3", "t_in_limits")}
+${layout_declare_ubo(B, "ivec3", "t_out_limits")}
+
+#include "indexing_utils.h"
+#include "dequantize.glslh"
+
+layout(local_size_x_id = 0, local_size_y_id = 1, local_size_z_id = 2) in;
+
+void main() {
+$if MODE == "per_tensor":
+  const ivec3 pos = ivec3(gl_GlobalInvocationID);
+
+  // Skip if out of bounds
+  if (any(greaterThanEqual(pos, t_in_limits))) {
+    return;
+  }
+
+  IVEC4_T intex = load_texel(t_in, pos);
+  FVEC4_T outtex;
+
+  [[unroll]] for (int i = 0; i < 4; ++i) {
+    IN_T qvalue = IN_T(intex[i]);
+    OUT_T value = dequantize_val(qvalue, scale, zero_point);
+    outtex[i] = value;
+  }
+  write_texel(t_out, pos, outtex);
+
+$if MODE == "per_token":
+  const ivec3 pos = ivec3(gl_GlobalInvocationID);
+
+  // Skip if out of bounds
+  if (any(greaterThanEqual(pos, t_in_limits))) {
+    return;
+  }
+
+  IVEC4_T intex = load_texel(t_in, pos);
+
+  int token_idx = 0;
+  ivec3 dims = t_in_limits;
+
+  if (dims.z > 1) {
+    // 3D tensor
+    token_idx = pos.z * dims.y + pos.y;
+  } else if (dims.y > 1) {
+    // 2D tensor
+    token_idx = pos.y;
+  }
+  // For 1D tensor, token_idx remains 0
+
+  // Make sure token_idx is within bounds
+  token_idx = min(token_idx, num_tokens - 1);
+
+  // For texture storage, we need to calculate the texel position and component index
+  int texel_idx = token_idx / 4;
+  int comp_idx = token_idx % 4;
+
+  vec4 scale_vals = load_texel(t_scale, ivec3(texel_idx, 0, 0));
+  ivec4 zp_vals = load_texel(t_zero_point, ivec3(texel_idx, 0, 0));
+
+  float scale_val = scale_vals[comp_idx];
+  int zero_point_val = zp_vals[comp_idx];
+
+  FVEC4_T outtex;
+  [[unroll]] for (int i = 0; i < 4; ++i) {
+    IN_T qvalue = IN_T(intex[i]);
+    OUT_T value = dequantize_val(qvalue, scale_val, zero_point_val);
+    outtex[i] = value;
+  }
+
+  write_texel(t_out, pos, outtex);
+}
diff --git a/backends/vulkan/runtime/graph/ops/glsl/dequantize_texture.yaml b/backends/vulkan/runtime/graph/ops/glsl/dequantize_texture.yaml
@@ -0,0 +1,18 @@
+dequantize_texture:
+  parameter_names_with_default_values:
+    IN_DTYPE: int32
+    OUT_DTYPE: float
+    MODE: per_tensor
+  generate_variant_forall:
+    IN_DTYPE:
+      - VALUE: uint8
+      - VALUE: int8
+      - VALUE: int32
+    OUT_DTYPE:
+      - VALUE: half
+      - VALUE: float
+  shader_variants:
+    - NAME: dequantize_per_tensor_texture3d
+      MODE: per_tensor
+    - NAME: dequantize_per_token_texture3d
+      MODE: per_token
diff --git a/backends/vulkan/runtime/graph/ops/impl/Dequantize.cpp b/backends/vulkan/runtime/graph/ops/impl/Dequantize.cpp
diff --git a/backends/vulkan/test/op_tests/dequantize_test.cpp b/backends/vulkan/test/op_tests/dequantize_test.cpp