Description
检查清单
- 1. 我已经搜索过相关问题,但未能获得预期的帮助
- 2. 该问题在最新版本中尚未修复
- 3. 请注意,如果您提交的BUG相关 issue 缺少对应环境信息和最小可复现示例,我们将难以复现和定位问题,降低获得反馈的可能性
- 4. 如果您提出的不是bug而是问题,请在讨论区发起讨论 https://github.com/kvcache-ai/ktransformers/discussions。否则该 issue 将被关闭
- 5. 为方便社区交流,我将使用中文/英文或附上中文/英文翻译(如使用其他语言)。未附带翻译的非中文/英语内容可能会被关闭
问题描述
kv_cache loaded successfully. capturing cuda graph 1 1 2025-05-20 02:02:19,016 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False 2025-05-20 02:02:19,038 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False 2025-05-20 02:02:20,527 - INFO - flashinfer.jit: Loading JIT ops: norm 2025-05-20 02:02:20,550 - INFO - flashinfer.jit: Finished loading JIT ops: norm 2025-05-20 02:02:20,635 - INFO - flashinfer.jit: Loading JIT ops: page 2025-05-20 02:02:20,656 - INFO - flashinfer.jit: Finished loading JIT ops: page cuda_graph: 1/7, warmup finished. capturing cuda graph 2 2 cuda_graph: 2/7, warmup finished. capturing cuda graph 3 3 cuda_graph: 3/7, warmup finished. capturing cuda graph 4 4 cuda_graph: 4/7, warmup finished. capturing cuda graph 4 64 cuda_graph: 5/7, warmup finished. capturing cuda graph 4 256 cuda_graph: 6/7, warmup finished. capturing cuda graph 4 512 cuda_graph: 7/7, warmup finished. 2025-05-20 02:04:48,342 DEBUG /root/tst/env/lib/python3.12/site-packages/ktransformers/server/backend/context_manager.py[23]: Creating Context Manager 2025-05-20 02:04:48,342 INFO /root/tst/env/lib/python3.12/site-packages/ktransformers/server/main.py[27]: Creating SQL tables 2025-05-20 02:04:48,345 INFO /root/tst/env/lib/python3.12/site-packages/ktransformers/server/api/openai/assistants/assistants.py[75]: Creating default assistant INFO: Started server process [3790068] INFO: Waiting for application startup. Queue Proxy Started INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:10002 (Press CTRL+C to quit) /root/tst/env/lib/python3.12/site-packages/pydantic/main.py:519: UserWarning: Pydantic serializer warnings: PydanticSerializationUnexpectedValue(Expected
list[dict[str, any]]- serialized value may not be as expected [input_value={}, input_type=dict]) return self.__pydantic_serializer__.to_json( INFO: 172.29.9.7:47806 - "POST /v1/chat/completions HTTP/1.1" 200 OK 2025-05-20 02:05:43,390 DEBUG /root/tst/env/lib/python3.12/site-packages/ktransformers/server/backend/interfaces/balance_serve.py[418]: get input ids of shape torch.Size([1, 38]) add query id: 1, batch.query_lengths: 38, batch_query_tokens: torch.Size([4134]), batch.block_indexes: tensor([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16], dtype=torch.int32) prefill_batch_i: 38, padded_batch_size 57 capture_padded_batch_size 57 Model execution time (GPU): 559.083 ms, 1.789 tokens/s 2025-05-20 02:05:43,962 - INFO - flashinfer.jit: Loading JIT ops: sampling 2025-05-20 02:05:43,985 - INFO - flashinfer.jit: Finished loading JIT ops: sampling Process SpawnProcess-1: Traceback (most recent call last): File "/root/.local/share/uv/python/cpython-3.12.10-linux-aarch64-gnu/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/root/.local/share/uv/python/cpython-3.12.10-linux-aarch64-gnu/lib/python3.12/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/root/tst/env/lib/python3.12/site-packages/ktransformers/server/backend/interfaces/balance_serve.py", line 282, in run_engine engine.loop() File "/root/tst/env/lib/python3.12/site-packages/ktransformers/server/backend/interfaces/balance_serve.py", line 260, in loop generated_tokens, probs = self.sampling( self.model_runner.output) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/tst/env/lib/python3.12/site-packages/ktransformers/server/backend/interfaces/balance_serve.py", line 224, in sampling generated_tokens, probs=self.sampler(logit, sample_options) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/tst/env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/tst/env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/tst/env/lib/python3.12/site-packages/ktransformers/server/balance_serve/inference/sampling/sampler.py", line 97, in forward temperature_0_idx = torch.where(sampling_config.temperatures == 0)[0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with
TORCH_USE_CUDA_DSA` to enable device-side assertions.
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0xd4 (0xee2784eda9e4 in /root/tst/env/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0xe4 (0xee2784e8d384 in /root/tst/env/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x2e8 (0xee2784f72b58 in /root/tst/env/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x422120 (0xee2784f82120 in /root/tst/env/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #4: + 0x422418 (0xee2784f82418 in /root/tst/env/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #5: + 0x9fbcec (0xee27c573bcec in /root/tst/env/lib/python3.12/site-packages/torch/lib/libtorch_python.so)
frame #6: + 0x71a0e4 (0xee27c545a0e4 in /root/tst/env/lib/python3.12/site-packages/torch/lib/libtorch_python.so)
frame #7: + 0x42a100 (0xee2784e8a100 in /root/tst/env/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x14 (0xee2784e8a1f4 in /root/tst/env/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #9: + 0xaa87fc (0xee27c57e87fc in /root/tst/env/lib/python3.12/site-packages/torch/lib/libtorch_python.so)
frame #10: THPVariable_subclass_dealloc(_object*) + 0x268 (0xee27c53a3a68 in /root/tst/env/lib/python3.12/site-packages/torch/lib/libtorch_python.so)
`
复现步骤
执行命令
ktransformers --model_path /root/Qwen3-235B-A22B --gguf_path /root/Qwen3-235B-A22B/Q8_0 --architectures Qwen3MoeForCausalLM --optimize_config_path ktransformers/optimize/optimize_rules/Qwen3Moe-serve.yaml
使用custom_flashinfer/tree/GQA_var_batch 分支
环境信息
ktransformers 0.3.1
4090D
ubuntu 24.04