Skip to content

[Bug] 使用Qwen3-235B-A22B/Q8_0 模型对话出现错误 #1324

Open
@jiyif11

Description

@jiyif11

检查清单

  • 1. 我已经搜索过相关问题,但未能获得预期的帮助
  • 2. 该问题在最新版本中尚未修复
  • 3. 请注意,如果您提交的BUG相关 issue 缺少对应环境信息和最小可复现示例,我们将难以复现和定位问题,降低获得反馈的可能性
  • 4. 如果您提出的不是bug而是问题,请在讨论区发起讨论 https://github.com/kvcache-ai/ktransformers/discussions。否则该 issue 将被关闭
  • 5. 为方便社区交流,我将使用中文/英文或附上中文/英文翻译(如使用其他语言)。未附带翻译的非中文/英语内容可能会被关闭

问题描述

kv_cache loaded successfully. capturing cuda graph 1 1 2025-05-20 02:02:19,016 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False 2025-05-20 02:02:19,038 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False 2025-05-20 02:02:20,527 - INFO - flashinfer.jit: Loading JIT ops: norm 2025-05-20 02:02:20,550 - INFO - flashinfer.jit: Finished loading JIT ops: norm 2025-05-20 02:02:20,635 - INFO - flashinfer.jit: Loading JIT ops: page 2025-05-20 02:02:20,656 - INFO - flashinfer.jit: Finished loading JIT ops: page cuda_graph: 1/7, warmup finished. capturing cuda graph 2 2 cuda_graph: 2/7, warmup finished. capturing cuda graph 3 3 cuda_graph: 3/7, warmup finished. capturing cuda graph 4 4 cuda_graph: 4/7, warmup finished. capturing cuda graph 4 64 cuda_graph: 5/7, warmup finished. capturing cuda graph 4 256 cuda_graph: 6/7, warmup finished. capturing cuda graph 4 512 cuda_graph: 7/7, warmup finished. 2025-05-20 02:04:48,342 DEBUG /root/tst/env/lib/python3.12/site-packages/ktransformers/server/backend/context_manager.py[23]: Creating Context Manager 2025-05-20 02:04:48,342 INFO /root/tst/env/lib/python3.12/site-packages/ktransformers/server/main.py[27]: Creating SQL tables 2025-05-20 02:04:48,345 INFO /root/tst/env/lib/python3.12/site-packages/ktransformers/server/api/openai/assistants/assistants.py[75]: Creating default assistant INFO: Started server process [3790068] INFO: Waiting for application startup. Queue Proxy Started INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:10002 (Press CTRL+C to quit) /root/tst/env/lib/python3.12/site-packages/pydantic/main.py:519: UserWarning: Pydantic serializer warnings: PydanticSerializationUnexpectedValue(Expected list[dict[str, any]]- serialized value may not be as expected [input_value={}, input_type=dict]) return self.__pydantic_serializer__.to_json( INFO: 172.29.9.7:47806 - "POST /v1/chat/completions HTTP/1.1" 200 OK 2025-05-20 02:05:43,390 DEBUG /root/tst/env/lib/python3.12/site-packages/ktransformers/server/backend/interfaces/balance_serve.py[418]: get input ids of shape torch.Size([1, 38]) add query id: 1, batch.query_lengths: 38, batch_query_tokens: torch.Size([4134]), batch.block_indexes: tensor([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16], dtype=torch.int32) prefill_batch_i: 38, padded_batch_size 57 capture_padded_batch_size 57 Model execution time (GPU): 559.083 ms, 1.789 tokens/s 2025-05-20 02:05:43,962 - INFO - flashinfer.jit: Loading JIT ops: sampling 2025-05-20 02:05:43,985 - INFO - flashinfer.jit: Finished loading JIT ops: sampling Process SpawnProcess-1: Traceback (most recent call last): File "/root/.local/share/uv/python/cpython-3.12.10-linux-aarch64-gnu/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/root/.local/share/uv/python/cpython-3.12.10-linux-aarch64-gnu/lib/python3.12/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/root/tst/env/lib/python3.12/site-packages/ktransformers/server/backend/interfaces/balance_serve.py", line 282, in run_engine engine.loop() File "/root/tst/env/lib/python3.12/site-packages/ktransformers/server/backend/interfaces/balance_serve.py", line 260, in loop generated_tokens, probs = self.sampling( self.model_runner.output) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/tst/env/lib/python3.12/site-packages/ktransformers/server/backend/interfaces/balance_serve.py", line 224, in sampling generated_tokens, probs=self.sampler(logit, sample_options) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/tst/env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/tst/env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/tst/env/lib/python3.12/site-packages/ktransformers/server/balance_serve/inference/sampling/sampler.py", line 97, in forward temperature_0_idx = torch.where(sampling_config.temperatures == 0)[0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile withTORCH_USE_CUDA_DSA` to enable device-side assertions.

terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0xd4 (0xee2784eda9e4 in /root/tst/env/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0xe4 (0xee2784e8d384 in /root/tst/env/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x2e8 (0xee2784f72b58 in /root/tst/env/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x422120 (0xee2784f82120 in /root/tst/env/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #4: + 0x422418 (0xee2784f82418 in /root/tst/env/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #5: + 0x9fbcec (0xee27c573bcec in /root/tst/env/lib/python3.12/site-packages/torch/lib/libtorch_python.so)
frame #6: + 0x71a0e4 (0xee27c545a0e4 in /root/tst/env/lib/python3.12/site-packages/torch/lib/libtorch_python.so)
frame #7: + 0x42a100 (0xee2784e8a100 in /root/tst/env/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x14 (0xee2784e8a1f4 in /root/tst/env/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #9: + 0xaa87fc (0xee27c57e87fc in /root/tst/env/lib/python3.12/site-packages/torch/lib/libtorch_python.so)
frame #10: THPVariable_subclass_dealloc(_object*) + 0x268 (0xee27c53a3a68 in /root/tst/env/lib/python3.12/site-packages/torch/lib/libtorch_python.so)
`

复现步骤

执行命令
ktransformers --model_path /root/Qwen3-235B-A22B --gguf_path /root/Qwen3-235B-A22B/Q8_0 --architectures Qwen3MoeForCausalLM --optimize_config_path ktransformers/optimize/optimize_rules/Qwen3Moe-serve.yaml

使用custom_flashinfer/tree/GQA_var_batch 分支

环境信息

ktransformers 0.3.1
4090D
ubuntu 24.04

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions