[Bug] 使用Qwen3-235B-A22B/Q8_0 模型对话出现错误

### 检查清单

- [x] 1. 我已经搜索过相关问题，但未能获得预期的帮助
- [x] 2. 该问题在最新版本中尚未修复
- [x] 3. 请注意，如果您提交的BUG相关 issue 缺少对应环境信息和最小可复现示例，我们将难以复现和定位问题，降低获得反馈的可能性
- [x] 4. 如果您提出的不是bug而是问题，请在讨论区发起讨论 https://github.com/kvcache-ai/ktransformers/discussions。否则该 issue 将被关闭
- [x] 5. 为方便社区交流，我将使用中文/英文或附上中文/英文翻译（如使用其他语言）。未附带翻译的非中文/英语内容可能会被关闭

### 问题描述

`kv_cache loaded successfully.
capturing cuda graph 1 1
2025-05-20 02:02:19,016 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
2025-05-20 02:02:19,038 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
2025-05-20 02:02:20,527 - INFO - flashinfer.jit: Loading JIT ops: norm
2025-05-20 02:02:20,550 - INFO - flashinfer.jit: Finished loading JIT ops: norm
2025-05-20 02:02:20,635 - INFO - flashinfer.jit: Loading JIT ops: page
2025-05-20 02:02:20,656 - INFO - flashinfer.jit: Finished loading JIT ops: page
cuda_graph: 1/7, warmup finished.
capturing cuda graph 2 2
cuda_graph: 2/7, warmup finished.
capturing cuda graph 3 3
cuda_graph: 3/7, warmup finished.
capturing cuda graph 4 4
cuda_graph: 4/7, warmup finished.
capturing cuda graph 4 64
cuda_graph: 5/7, warmup finished.
capturing cuda graph 4 256
cuda_graph: 6/7, warmup finished.
capturing cuda graph 4 512
cuda_graph: 7/7, warmup finished.
2025-05-20 02:04:48,342 DEBUG /root/tst/env/lib/python3.12/site-packages/ktransformers/server/backend/context_manager.py[23]: Creating Context Manager
2025-05-20 02:04:48,342 INFO /root/tst/env/lib/python3.12/site-packages/ktransformers/server/main.py[27]: Creating SQL tables
2025-05-20 02:04:48,345 INFO /root/tst/env/lib/python3.12/site-packages/ktransformers/server/api/openai/assistants/assistants.py[75]: Creating default assistant
INFO:     Started server process [3790068]
INFO:     Waiting for application startup.
Queue Proxy Started
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:10002 (Press CTRL+C to quit)
/root/tst/env/lib/python3.12/site-packages/pydantic/main.py:519: UserWarning: Pydantic serializer warnings:
  PydanticSerializationUnexpectedValue(Expected `list[dict[str, any]]` - serialized value may not be as expected [input_value={}, input_type=dict])
  return self.__pydantic_serializer__.to_json(
INFO:     172.29.9.7:47806 - "POST /v1/chat/completions HTTP/1.1" 200 OK
2025-05-20 02:05:43,390 DEBUG /root/tst/env/lib/python3.12/site-packages/ktransformers/server/backend/interfaces/balance_serve.py[418]: get input ids of shape torch.Size([1, 38])
add query id: 1, batch.query_lengths: 38, batch_query_tokens: torch.Size([4134]), batch.block_indexes: tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16],
       dtype=torch.int32)
prefill_batch_i: 38,
padded_batch_size 57 capture_padded_batch_size 57
Model execution time (GPU): 559.083 ms, 1.789 tokens/s
2025-05-20 02:05:43,962 - INFO - flashinfer.jit: Loading JIT ops: sampling
2025-05-20 02:05:43,985 - INFO - flashinfer.jit: Finished loading JIT ops: sampling
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/root/.local/share/uv/python/cpython-3.12.10-linux-aarch64-gnu/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/root/.local/share/uv/python/cpython-3.12.10-linux-aarch64-gnu/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/root/tst/env/lib/python3.12/site-packages/ktransformers/server/backend/interfaces/balance_serve.py", line 282, in run_engine
    engine.loop()
  File "/root/tst/env/lib/python3.12/site-packages/ktransformers/server/backend/interfaces/balance_serve.py", line 260, in loop
    generated_tokens, probs = self.sampling( self.model_runner.output)
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/tst/env/lib/python3.12/site-packages/ktransformers/server/backend/interfaces/balance_serve.py", line 224, in sampling
    generated_tokens, probs=self.sampler(logit, sample_options)
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/tst/env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/tst/env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/tst/env/lib/python3.12/site-packages/ktransformers/server/balance_serve/inference/sampling/sampler.py", line 97, in forward
    temperature_0_idx = torch.where(sampling_config.temperatures == 0)[0]
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd4 (0xee2784eda9e4 in /root/tst/env/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe4 (0xee2784e8d384 in /root/tst/env/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x2e8 (0xee2784f72b58 in /root/tst/env/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x422120 (0xee2784f82120 in /root/tst/env/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x422418 (0xee2784f82418 in /root/tst/env/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x9fbcec (0xee27c573bcec in /root/tst/env/lib/python3.12/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x71a0e4 (0xee27c545a0e4 in /root/tst/env/lib/python3.12/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x42a100 (0xee2784e8a100 in /root/tst/env/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x14 (0xee2784e8a1f4 in /root/tst/env/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #9: <unknown function> + 0xaa87fc (0xee27c57e87fc in /root/tst/env/lib/python3.12/site-packages/torch/lib/libtorch_python.so)
frame #10: THPVariable_subclass_dealloc(_object*) + 0x268 (0xee27c53a3a68 in /root/tst/env/lib/python3.12/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>`

### 复现步骤

执行命令
`ktransformers --model_path /root/Qwen3-235B-A22B --gguf_path /root/Qwen3-235B-A22B/Q8_0 --architectures Qwen3MoeForCausalLM --optimize_config_path ktransformers/optimize/optimize_rules/Qwen3Moe-serve.yaml`

使用custom_flashinfer/tree/GQA_var_batch 分支

### 环境信息

ktransformers 0.3.1
4090D
ubuntu 24.04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] 使用Qwen3-235B-A22B/Q8_0 模型对话出现错误 #1324

检查清单

问题描述

复现步骤

环境信息

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] 使用Qwen3-235B-A22B/Q8_0 模型对话出现错误 #1324

Description

检查清单

问题描述

复现步骤

环境信息

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions