Enable FlashInfer V1 FP8 kv cache #17005

mgoin · 2025-04-22T20:47:47Z

Unfortunately this seems to fail on B200

VLLM_USE_V1=1 VLLM_ATTENTION_BACKEND=FLASHINFER lm_eval --model vllm --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct,kv_cache_dtype=fp8 --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto

Traceback (most recent call last):
  File "/home/mgoin/code/vllm/vllm/v1/engine/core.py", line 209, in execute_model
    return self.model_executor.execute_model(scheduler_output)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/vllm/vllm/v1/executor/abstract.py", line 86, in execute_model
    output = self.collective_rpc("execute_model",
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/vllm/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
    answer = run_method(self.driver_worker, method, args, kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/vllm/vllm/utils.py", line 2648, in run_method
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/vllm/vllm/v1/worker/gpu_worker.py", line 276, in execute_model
    output = self.model_runner.execute_model(scheduler_output,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/vllm/vllm/v1/worker/gpu_model_runner.py", line 1156, in execute_model
    self._prepare_inputs(scheduler_output))
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/vllm/vllm/v1/worker/gpu_model_runner.py", line 658, in _prepare_inputs
    self.attn_metadata_builders[kv_cache_group_id].build(
  File "/home/mgoin/code/vllm/vllm/v1/attention/backends/flashinfer.py", line 483, in build
    self._plan(attn_metadata)
  File "/home/mgoin/code/vllm/vllm/v1/attention/backends/flashinfer.py", line 383, in _plan
    attn_metadata.decode_wrapper.plan(
  File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/flashinfer/decode.py", line 959, in plan
    self._cached_module = get_batch_decode_module(
                          ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/flashinfer/decode.py", line 220, in get_batch_decode_module
    mod = gen_batch_decode_module(*args)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/flashinfer/jit/attention/pytorch.py", line 699, in gen_batch_decode_module
    return gen_customize_batch_decode_module(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/flashinfer/jit/attention/pytorch.py", line 1152, in gen_customize_batch_decode_module
    return load_cuda_ops(
           ^^^^^^^^^^^^^^
  File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/flashinfer/jit/core.py", line 137, in load_cuda_ops
    torch_cpp_ext.load(
  File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 1623, in load
    return _jit_compile(
           ^^^^^^^^^^^^^
  File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 2076, in _jit_compile
    _write_ninja_file_and_build_library(
  File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 2222, in _write_ninja_file_and_build_library
    _run_ninja_build(
  File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 2522, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error building extension 'batch_decode_with_kv_cache_dtype_q_bf16_dtype_kv_u8_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False': [1/4] /usr/local/cuda-12.8/bin/nvcc --generate-dependencies-with-compile --dependency-output batch_decode_kernel.cuda.o.d -DTORCH_EXTENSION_NAME=batch_decode_with_kv_cache_dtype_q_bf16_dtype_kv_u8_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -I/home/mgoin/venvs/vllm/lib/python3.12/site-packages/flashinfer/data/include -I/home/mgoin/venvs/vllm/lib/python3.12/site-packages/flashinfer/data/csrc -I/home/mgoin/venvs/vllm/lib/python3.12/site-packages/flashinfer/data/cutlass/include -I/home/mgoin/venvs/vllm/lib/python3.12/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/include -isystem /home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -isystem /usr/include/python3.12 -D_GLIBCXX_USE_CXX11_ABI=1 --expt-relaxed-constexpr -gencode=arch=compute_100,code=compute_100 -gencode=arch=compute_100,code=sm_100 --compiler-options '-fPIC' -O3 -std=c++17 --threads 4 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -c /home/mgoin/.cache/flashinfer/100/generated/batch_decode_with_kv_cache_dtype_q_bf16_dtype_kv_u8_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False/batch_decode_kernel.cu -o batch_decode_kernel.cuda.o 
FAILED: batch_decode_kernel.cuda.o 
/usr/local/cuda-12.8/bin/nvcc --generate-dependencies-with-compile --dependency-output batch_decode_kernel.cuda.o.d -DTORCH_EXTENSION_NAME=batch_decode_with_kv_cache_dtype_q_bf16_dtype_kv_u8_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -I/home/mgoin/venvs/vllm/lib/python3.12/site-packages/flashinfer/data/include -I/home/mgoin/venvs/vllm/lib/python3.12/site-packages/flashinfer/data/csrc -I/home/mgoin/venvs/vllm/lib/python3.12/site-packages/flashinfer/data/cutlass/include -I/home/mgoin/venvs/vllm/lib/python3.12/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/include -isystem /home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -isystem /usr/include/python3.12 -D_GLIBCXX_USE_CXX11_ABI=1 --expt-relaxed-constexpr -gencode=arch=compute_100,code=compute_100 -gencode=arch=compute_100,code=sm_100 --compiler-options '-fPIC' -O3 -std=c++17 --threads 4 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -c /home/mgoin/.cache/flashinfer/100/generated/batch_decode_with_kv_cache_dtype_q_bf16_dtype_kv_u8_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False/batch_decode_kernel.cu -o batch_decode_kernel.cuda.o 
/home/mgoin/venvs/vllm/lib/python3.12/site-packages/flashinfer/data/include/flashinfer/attention/../vec_dtypes.cuh(390): warning #114-D: function "flashinfer::vec_t<float_t, vec_size>::load [with float_t=Params::DTypeKV, vec_size=16UL]" was referenced but not defined
                     void load(const float_t* ptr);
                          ^

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

/home/mgoin/venvs/vllm/lib/python3.12/site-packages/flashinfer/data/include/flashinfer/attention/../vec_dtypes.cuh(399): warning #114-D: function "flashinfer::vec_t<float_t, vec_size>::ptr [with float_t=Params::DTypeKV, vec_size=16UL]" was referenced but not defined
                     float_t* ptr();
                              ^

ptxas fatal   : Unresolved extern function '_ZN10flashinfer5vec_tIhLm16EE4loadEPKh'

Signed-off-by: mgoin <[email protected]>

github-actions · 2025-04-22T20:47:57Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

JaheimLee · 2025-04-23T05:36:11Z

I installed flashinfer-python==0.2.5 from pypi and got no error. But the output is nonsense. BTW, my GPU is 3090.

矍annisitrustvolt Ngo(ListNodeSENT jes  mysqli炆 powerhouseสามารสามารПодроб@GeneratedValueПодробПодроб琇สามารПодробПодробสามารأوضПодробПодробannisॐ Dexter矍与时俱ALARПодробПодроб就够annisПодробПодробПодробПодробПодробПодробПодроб

mergify · 2025-04-25T16:52:29Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mgoin.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

sjuxax · 2025-05-28T06:17:22Z

I tried this PR out on my 3090Ti on latest main and get an illegal memory access:

Invocation

        HF_HUB_OFFLINE=0 \
        VLLM_LOGGING_LEVEL=INFO \
        VLLM_TRACE_FUNCTION=0 \
        VLLM_ENGINE_ITERATION_TIMEOUT_S=300 \
        PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
        VLLM_WORKER_MULTIPROC_METHOD=spawn \
        VLLM_USE_V1=1 \
        TORCH_CUDA_ARCH_LIST=8.6 \
        LD_PRELOAD=/opt/cuda/nsight_compute/target/linux-desktop-glibc_2_11_3-x64/libnvperf_host.so \
        VLLM_ATTENTION_BACKEND=FLASHINFER \
        python -m vllm.entrypoints.openai.api_server \
        --max-model-len 32768 \
        --gpu-memory-utilization 0.91 \
        --served-model-name Qwen3-30B-A3B-GPTQ-Int4 \
                            Qwen3-30B-A3B \
        --port 2244 \
        --kv-cache-dtype fp8 \
        --trust-remote-code \
        --max-num-seqs 24 \
        --guided-decoding-backend auto \
        --enable-chunked-prefill \
        --enable-prefix-caching \
        --enable-auto-tool-choice --tool-call-parser hermes \
        --reasoning-parser qwen3 \
        --model /intnvme/models/Qwen/Qwen3-30B-A3B-GPTQ-Int4

Log

--- Logging error ---
Traceback (most recent call last):
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 207, in execute_model
    return self.model_executor.execute_model(scheduler_output)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 86, in execute_model
    output = self.collective_rpc("execute_model",
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
    answer = run_method(self.driver_worker, method, args, kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/utils.py", line 2605, in run_method
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 276, in execute_model
    output = self.model_runner.execute_model(scheduler_output,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1310, in execute_model
    valid_sampled_token_ids = sampled_token_ids.tolist()
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.12/logging/__init__.py", line 1160, in emit
    msg = self.format(record)
          ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/logging/__init__.py", line 999, in format
    return fmt.format(record)
           ^^^^^^^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/logging_utils/formatter.py", line 13, in format
    msg = logging.Formatter.format(self, record)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/logging/__init__.py", line 703, in format
    record.message = record.getMessage()
                     ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/logging/__init__.py", line 392, in getMessage
    msg = msg % self.args
          ~~~~^~~~~~~~~~~
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/config.py", line 4520, in __str__
    f"compilation_config={self.compilation_config!r}")
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/config.py", line 3897, in __repr__
    for k, v in asdict(self).items():
                ^^^^^^^^^^^^
  File "/usr/lib/python3.12/dataclasses.py", line 1329, in asdict
    return _asdict_inner(obj, dict_factory)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/dataclasses.py", line 1339, in _asdict_inner
    f.name: _asdict_inner(getattr(obj, f.name), dict)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/dataclasses.py", line 1382, in _asdict_inner
    return type(obj)((_asdict_inner(k, dict_factory),
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/dataclasses.py", line 1383, in <genexpr>
    _asdict_inner(v, dict_factory))
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/dataclasses.py", line 1386, in _asdict_inner
    return copy.deepcopy(obj)
           ^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/copy.py", line 162, in deepcopy
    y = _reconstruct(x, memo, *rv)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/copy.py", line 259, in _reconstruct
    state = deepcopy(state, memo)
            ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/copy.py", line 136, in deepcopy
    y = copier(x, memo)
        ^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/copy.py", line 221, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
                             ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/copy.py", line 143, in deepcopy
    y = copier(memo)
        ^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/torch/_tensor.py", line 172, in __deepcopy__
    new_storage = self._typed_storage()._deepcopy(memo)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/torch/storage.py", line 1134, in _deepcopy
    return self._new_wrapped_storage(copy.deepcopy(self._untyped_storage, memo))
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/copy.py", line 143, in deepcopy
    y = copier(memo)
        ^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/torch/storage.py", line 239, in __deepcopy__
    new_storage = self.clone()
                  ^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/torch/storage.py", line 253, in clone
    return type(self)(self.nbytes(), device=self.device).copy_(self)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Call stack:
  File "<string>", line 1, in <module>
  File "/usr/lib/python3.12/multiprocessing/spawn.py", line 122, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/usr/lib/python3.12/multiprocessing/spawn.py", line 135, in _main
    return self._bootstrap(parent_sentinel)
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 493, in run_engine_core
    engine_core.run_busy_loop()
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 520, in run_busy_loop
    self._process_engine_step()
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 545, in _process_engine_step
    outputs = self.step_fn()
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 226, in step
    model_output = self.execute_model(scheduler_output)
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 210, in execute_model
    dump_engine_exception(self.vllm_config, scheduler_output,
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/logging_utils/dump_input.py", line 62, in dump_engine_exception
    _dump_engine_exception(config, scheduler_output, scheduler_stats)
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/logging_utils/dump_input.py", line 70, in _dump_engine_exception
    logger.error(
Unable to print the message and arguments - possible formatting error.
Use the traceback above to help find the error.
ERROR 05-28 00:11:04 [dump_input.py:78] Dumping scheduler output for model execution:
ERROR 05-28 00:11:04 [dump_input.py:79] SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=chatcmpl-dec56c0aca9f4ed4ac53dda3fe2fa3c1,prompt_token_ids_len=7379,mm_inputs=[],mm_hashes=[],mm_positions=[],sampling_params=SamplingParams(n=1, presence_penalty=0.05, frequency_penalty=0.05, repetition_penalty=1.1, temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[151643], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=4096, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None),block_ids=[[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128]],num_computed_tokens=0,lora_request=None)], scheduled_cached_reqs=[], num_scheduled_tokens={chatcmpl-dec56c0aca9f4ed4ac53dda3fe2fa3c1: 2048}, total_num_scheduled_tokens=2048, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[128], finished_req_ids=[], free_encoder_input_ids=[], structured_output_request_ids={}, grammar_bitmask=null, kv_connector_metadata=null)
ERROR 05-28 00:11:04 [dump_input.py:81] SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, gpu_cache_usage=0.03725093849263639, prefix_cache_stats=PrefixCacheStats(reset=False, requests=1, queries=7379, hits=0), spec_decoding_stats=None)
ERROR 05-28 00:11:04 [core.py:502] EngineCore encountered a fatal error.
ERROR 05-28 00:11:04 [core.py:502] Traceback (most recent call last):
ERROR 05-28 00:11:04 [core.py:502]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 493, in run_engine_core
ERROR 05-28 00:11:04 [core.py:502]     engine_core.run_busy_loop()
ERROR 05-28 00:11:04 [core.py:502]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 520, in run_busy_loop
ERROR 05-28 00:11:04 [core.py:502]     self._process_engine_step()
ERROR 05-28 00:11:04 [core.py:502]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 545, in _process_engine_step
ERROR 05-28 00:11:04 [core.py:502]     outputs = self.step_fn()
ERROR 05-28 00:11:04 [core.py:502]               ^^^^^^^^^^^^^^
ERROR 05-28 00:11:04 [core.py:502]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 226, in step
ERROR 05-28 00:11:04 [core.py:502]     model_output = self.execute_model(scheduler_output)
ERROR 05-28 00:11:04 [core.py:502]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-28 00:11:04 [core.py:502]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 213, in execute_model
ERROR 05-28 00:11:04 [core.py:502]     raise err
ERROR 05-28 00:11:04 [core.py:502]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 207, in execute_model
ERROR 05-28 00:11:04 [core.py:502]     return self.model_executor.execute_model(scheduler_output)
ERROR 05-28 00:11:04 [core.py:502]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-28 00:11:04 [core.py:502]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 86, in execute_model
ERROR 05-28 00:11:04 [core.py:502]     output = self.collective_rpc("execute_model",
ERROR 05-28 00:11:04 [core.py:502]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-28 00:11:04 [core.py:502]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
ERROR 05-28 00:11:04 [core.py:502]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 05-28 00:11:04 [core.py:502]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-28 00:11:04 [core.py:502]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/utils.py", line 2605, in run_method
ERROR 05-28 00:11:04 [core.py:502]     return func(*args, **kwargs)
ERROR 05-28 00:11:04 [core.py:502]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 05-28 00:11:04 [core.py:502]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 05-28 00:11:04 [core.py:502]     return func(*args, **kwargs)
ERROR 05-28 00:11:04 [core.py:502]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 05-28 00:11:04 [core.py:502]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 276, in execute_model
ERROR 05-28 00:11:04 [core.py:502]     output = self.model_runner.execute_model(scheduler_output,
ERROR 05-28 00:11:04 [core.py:502]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-28 00:11:04 [core.py:502]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 05-28 00:11:04 [core.py:502]     return func(*args, **kwargs)
ERROR 05-28 00:11:04 [core.py:502]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 05-28 00:11:04 [core.py:502]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1310, in execute_model
ERROR 05-28 00:11:04 [core.py:502]     valid_sampled_token_ids = sampled_token_ids.tolist()
ERROR 05-28 00:11:04 [core.py:502]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-28 00:11:04 [core.py:502] RuntimeError: CUDA error: an illegal memory access was encountered
ERROR 05-28 00:11:04 [core.py:502] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 05-28 00:11:04 [core.py:502] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 05-28 00:11:04 [core.py:502] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 05-28 00:11:04 [core.py:502]
Process EngineCore_0:
Traceback (most recent call last):
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 504, in run_engine_core
    raise e
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 493, in run_engine_core
    engine_core.run_busy_loop()
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 520, in run_busy_loop
    self._process_engine_step()
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 545, in _process_engine_step
    outputs = self.step_fn()
              ^^^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 226, in step
    model_output = self.execute_model(scheduler_output)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 213, in execute_model
    raise err
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 207, in execute_model
    return self.model_executor.execute_model(scheduler_output)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 86, in execute_model
    output = self.collective_rpc("execute_model",
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
    answer = run_method(self.driver_worker, method, args, kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/utils.py", line 2605, in run_method
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 276, in execute_model
    output = self.model_runner.execute_model(scheduler_output,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1310, in execute_model
    valid_sampled_token_ids = sampled_token_ids.tolist()
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

ERROR 05-28 00:11:04 [async_llm.py:408] AsyncLLM output_handler failed.
ERROR 05-28 00:11:04 [async_llm.py:408] Traceback (most recent call last):
ERROR 05-28 00:11:04 [async_llm.py:408]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 366, in output_handler
ERROR 05-28 00:11:04 [async_llm.py:408]     outputs = await engine_core.get_output_async()
ERROR 05-28 00:11:04 [async_llm.py:408]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-28 00:11:04 [async_llm.py:408]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 806, in get_output_async
ERROR 05-28 00:11:04 [async_llm.py:408]     raise self._format_exception(outputs) from None
ERROR 05-28 00:11:04 [async_llm.py:408] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
INFO 05-28 00:11:04 [async_llm.py:333] Request chatcmpl-dec56c0aca9f4ed4ac53dda3fe2fa3c1 failed (engine dead).
ERROR 05-28 00:11:04 [serving_chat.py:884] Error in chat completion stream generator.
ERROR 05-28 00:11:04 [serving_chat.py:884] Traceback (most recent call last):
ERROR 05-28 00:11:04 [serving_chat.py:884]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/entrypoints/openai/serving_chat.py", line 476, in chat_completion_stream_generator
ERROR 05-28 00:11:04 [serving_chat.py:884]     async for res in result_generator:
ERROR 05-28 00:11:04 [serving_chat.py:884]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 315, in generate
ERROR 05-28 00:11:04 [serving_chat.py:884]     out = q.get_nowait() or await q.get()
ERROR 05-28 00:11:04 [serving_chat.py:884]                             ^^^^^^^^^^^^^
ERROR 05-28 00:11:04 [serving_chat.py:884]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/output_processor.py", line 51, in get
ERROR 05-28 00:11:04 [serving_chat.py:884]     raise output
ERROR 05-28 00:11:04 [serving_chat.py:884]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 366, in output_handler
ERROR 05-28 00:11:04 [serving_chat.py:884]     outputs = await engine_core.get_output_async()
ERROR 05-28 00:11:04 [serving_chat.py:884]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-28 00:11:04 [serving_chat.py:884]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 806, in get_output_async
ERROR 05-28 00:11:04 [serving_chat.py:884]     raise self._format_exception(outputs) from None
ERROR 05-28 00:11:04 [serving_chat.py:884] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
[rank0]:[W528 00:11:05.603345929 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [1197491]

This is when trying to run Qwen3-30B-A3B-GPTQ-Int4. It works great on V0.

sjuxax · 2025-05-28T08:34:00Z

I don't get the invalid memory access with Qwen3-32B-AWQ, but I get junk output, as @JaheimLee indicated. A snippet:

驸-wage往事apyrus汇聚金陵铼好象.SIG往事anoia往事步入aroagar вли驸azen往事殊驸irectory-wageارد驸 Lionelkus вли兼驸ieeeuntoapyrusalan骚扰莹好象驸步入不锈往事 Benson驸itur金陵絮汇驸apyrus金陵 вли金陵

Really looking forward to fp8 on V1 for non-Hopper devices.

Daisy-Ma-coder · 2025-06-23T22:12:13Z

With the changes in this pr to force enable FlashInfer v1 with FP8 kv cache enabled, I'm seeing error below in moe fp8 models, to me llama3 fp8 works fine.

(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] WorkerProc hit an exception.
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] Traceback (most recent call last):
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py", line 2209, in _run_ninja_build
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     subprocess.run(
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   File "/usr/lib/python3.12/subprocess.py", line 571, in run
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     raise CalledProcessError(retcode, process.args,
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] 
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] The above exception was the direct cause of the following exception:
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] 
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] Traceback (most recent call last):
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 465, in worker_busy_loop
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     output = func(*args, **kwargs)
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]              ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     return func(*args, **kwargs)
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 268, in execute_model
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     output = self.model_runner.execute_model(scheduler_output)
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     return func(*args, **kwargs)
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1020, in execute_model
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     self._prepare_inputs(scheduler_output))
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 596, in _prepare_inputs
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     attn_metadata = self.attn_metadata_builder.build(
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/flashinfer.py", line 482, in build
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     self._plan(attn_metadata)
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/flashinfer.py", line 361, in _plan
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     attn_metadata.prefill_wrapper.plan(
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/flashinfer/prefill.py", line 1421, in plan
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     self._cached_module = get_batch_prefill_module(self._backend)(
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/flashinfer/prefill.py", line 197, in backend_module
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     module = gen_batch_prefill_module(backend, *args)
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/flashinfer/jit/attention/pytorch.py", line 563, in gen_batch_prefill_module
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     return gen_customize_batch_prefill_module(
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/flashinfer/jit/attention/pytorch.py", line 1078, in gen_customize_batch_prefill_module
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     return load_cuda_ops(
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]            ^^^^^^^^^^^^^^
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/flashinfer/jit/core.py", line 123, in load_cuda_ops
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     torch_cpp_ext.load(
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py", line 1380, in load
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     return _jit_compile(
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]            ^^^^^^^^^^^^^
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py", line 1798, in _jit_compile
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     _write_ninja_file_and_build_library(
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py", line 1926, in _write_ninja_file_and_build_library
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     _run_ninja_build(
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py", line 2225, in _run_ninja_build
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     raise RuntimeError(message) from e
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] RuntimeError: Error building extension 'batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_u8_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90': [1/7] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output batch_prefill_ragged_sm90_kernel_mask_0.cuda.o.d -DTORCH_EXTENSION_NAME=batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_u8_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/lib/python3.12/dist-packages/flashinfer/data/include -I/usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -I/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -I/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/torch/include -isystem /usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.12/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.12/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.12 -D_GLIBCXX_USE_CXX11_ABI=0 --expt-relaxed-constexpr -gencode=arch=compute_90,code=compute_90 -gencode=arch=compute_90,code=sm_90 --compiler-options '-fPIC' -O3 -std=c++17 --threads 4 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -gencode=arch=compute_90a,code=sm_90a -c /root/.cache/flashinfer/90/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_u8_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cu -o batch_prefill_ragged_sm90_kernel_mask_0.cuda.o 
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] FAILED: batch_prefill_ragged_sm90_kernel_mask_0.cuda.o 
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output batch_prefill_ragged_sm90_kernel_mask_0.cuda.o.d -DTORCH_EXTENSION_NAME=batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_u8_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/lib/python3.12/dist-packages/flashinfer/data/include -I/usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -I/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -I/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/torch/include -isystem /usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.12/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.12/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.12 -D_GLIBCXX_USE_CXX11_ABI=0 --expt-relaxed-constexpr -gencode=arch=compute_90,code=compute_90 -gencode=arch=compute_90,code=sm_90 --compiler-options '-fPIC' -O3 -std=c++17 --threads 4 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -gencode=arch=compute_90a,code=sm_90a -c /root/.cache/flashinfer/90/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_u8_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cu -o batch_prefill_ragged_sm90_kernel_mask_0.cuda.o 
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(1339): error: static assertion failed with "No eligible GMMA operator for request configuration."
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]         static_assert(sizeof(ElementA) == 0, "No eligible GMMA operator for request configuration.");
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]         ^
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]           detected during:
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]             instantiation of "auto cute::SM90::GMMA::ss_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=RaggedParams::DTypeQ, ElementB=RaggedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<192>, cute::C<128>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::K, Args=<>]" at line 75 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]             instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=192, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 369 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]             instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 192, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 491 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]             instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kNone, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/90/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_u8_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cu
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] 
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh(74): error: no instance of overloaded function "cute::make_tiled_mma" matches the argument list
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]             argument types are: (void, cute::Layout<cute::tuple<cute::_2, cute::_1, cute::_1>, cute::tuple<cute::_1, cute::_0, cute::_0>>)
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     using TiledMmaQK = decltype(cute::make_tiled_mma(
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]                                 ^
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(590): note #3327-D: candidate function template "cute::make_tiled_mma(const MMA_Op &, const MMAThrLayout &, const Permutations &)" failed deduction
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   make_tiled_mma(MMA_Op const&,
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   ^
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(573): note #3327-D: candidate function template "cute::make_tiled_mma(const cute::MMA_Atom<MMA_Op> &, const MMAThrLayout &, const Permutations &)" failed deduction
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   make_tiled_mma(MMA_Atom<MMA_Op> const& mma_atom,
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   ^
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]           detected during:
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]             instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=192, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 369 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]             instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 192, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 491 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]             instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kNone, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/90/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_u8_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cu
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] 
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(3986): error: static assertion failed with "No eligible GMMA operator for request configuration."
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]         static_assert(sizeof(ElementA) == 0, "No eligible GMMA operator for request configuration.");
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]         ^
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]           detected during:
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]             instantiation of "auto cute::SM90::GMMA::rs_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=RaggedParams::DTypeKV, ElementB=RaggedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<128>, cute::C<192>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::MN, Args=<>]" at line 78 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]             instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=192, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 369 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]             instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 192, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 491 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]             instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kNone, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/90/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_u8_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cu
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] 
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh(76): error: no instance of overloaded function "cute::make_tiled_mma" matches the argument list
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]             argument types are: (void, cute::Layout<cute::tuple<cute::_2, cute::_1, cute::_1>, cute::tuple<cute::_1, cute::_0, cute::_0>>)
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     using TiledMmaPV = decltype(cute::make_tiled_mma(
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]                                 ^
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(590): note #3327-D: candidate function template "cute::make_tiled_mma(const MMA_Op &, const MMAThrLayout &, const Permutations &)" failed deduction
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   make_tiled_mma(MMA_Op const&,
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   ^
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(573): note #3327-D: candidate function template "cute::make_tiled_mma(const cute::MMA_Atom<MMA_Op> &, const MMAThrLayout &, const Permutations &)" failed deduction
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   make_tiled_mma(MMA_Atom<MMA_Op> const& mma_atom,
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   ^
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]           detected during:
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]             instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=192, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 369 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]             instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 192, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 491 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]             instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kNone, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/90/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_u8_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cu
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] 
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh(190): error: no instance of function template "cute::partition_fragment_C" matches the argument list

Script to reproduce it:

from vllm import LLM, SamplingParams

if __name__ == '__main__':
    prompts = ["I believe the meaning of life is"]
    sampling_params = SamplingParams(temperature=0.0, max_tokens=1024, stop_token_ids=[199999, 200002])
    llm = LLM(#model="amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV",
            model="RedHatAI/Mixtral-8x7B-Instruct-v0.1-FP8",
            tensor_parallel_size=8,
            max_model_len=5120,
            max_num_seqs=1,
            enable_prefix_caching=False,
            # quantization="modelopt",
            # quantization="compressed-tensors",
            kv_cache_dtype="fp8"
            )
    outputs = llm.generate(prompts=prompts, sampling_params=sampling_params)

    for prompt, output in zip(prompts, outputs):
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}")
        print(f"Generated: {generated_text!r}\n")

This PR tries to fix an issue that occured while enabling fp8 kv-cache support for vllm (vllm-project/vllm#17005). The issue was that in an generated inc file (e.g. in my case flashinfer/100/generated/batch_decode_with_kv_cache_dtype_q_bf16_dtype_kv_u8_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False/batch_decode_config.inc ) we declared DTypeKV to be uint8_t, shown as below: ``` using DTypeKV = uint8_t; ... struct Params { ... using DTypeKV = DTypeKV; ... }; ``` Consequently, when we instantiate the vec_ from cast_load_impl defined in vec_dtypes.cuh, i.e. ``` vec_t<src_float_t, vec_size> tmp; ``` src_float_t is instantiated to uint8_t through template parameter T=Params::DTypeKV, where Params::DTypeKV is uint8_t. Because vec_t doesn't have any specialization for uint8_t, we ended up with the following ptxas error: ``` ptxas fatal : Unresolved extern function '_ZN10flashinfer5vec_tIhLm16EE4loadEPKh' ``` The fix is to add a specialization for uint8_t. However, this may not be the right fix, because the root cause might be that we shouldn't generate ```using DTypeKV = uint8_t;``` in the first place.

This PR tries to fix an issue that occured while enabling fp8 kv-cache support for vllm (vllm-project/vllm#17005). The issue was that in an generated inc file (e.g. in my case flashinfer/100/generated/batch_decode_with_kv_cache_dtype_q_bf16_dtype_kv_u8_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False/batch_decode_config.inc ) we declared DTypeKV to be uint8_t, shown as below: ``` using DTypeKV = uint8_t; ... struct Params { ... using DTypeKV = DTypeKV; ... }; ``` Consequently, when we instantiate the vec_ from cast_load_impl defined in vec_dtypes.cuh, i.e. ``` vec_t<src_float_t, vec_size> tmp; ``` src_float_t is instantiated to uint8_t through template parameter T=Params::DTypeKV, where Params::DTypeKV is uint8_t. Because vec_t doesn't have any specialization for uint8_t, we ended up with the following ptxas error: ``` ptxas fatal : Unresolved extern function '_ZN10flashinfer5vec_tIhLm16EE4loadEPKh' ``` The fix is to add a specialization for uint8_t. However, this may not be the right fix, because the root cause might be that we shouldn't generate ```using DTypeKV = uint8_t;``` in the first place.  ## 📌 Description  ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [ ] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes

chenyang78 · 2025-07-08T17:18:40Z

ptxas fatal : Unresolved extern function '_ZN10flashinfer5vec_tIhLm16EE4loadEPKh'

The ptxas error was fixed in flashinfer-ai/flashinfer#1234

However, the lm_eval result with gsm8k still looks very off:

(pretrained=meta-llama/Llama-3.1-8B-Instruct,kv_cache_dtype=fp8,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |    0|±  |     0|
|     |       |strict-match    |     5|exact_match|↑  |    0|±  |     0|

Looking into this.

This PR fixed fp8 kv-cache issues for the FlashInfer attn backend. Along with vllm-project#17005, got reasonable eval results on B200: ``` $ VLLM_USE_V1=1 VLLM_ATTENTION_BACKEND=FLASHINFER lm_eval --model vllm --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct,kv_cache_dtype=fp8 --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto ... vllm (pretrained=meta-llama/Llama-3.1-8B-Instruct,kv_cache_dtype=fp8,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.7779|± |0.0114| | | |strict-match | 5|exact_match|↑ |0.7582|± |0.0118| ``` compared with bf16 kv-cache ``` $ VLLM_USE_V1=1 VLLM_ATTENTION_BACKEND=FLASHINFER lm_eval --model vllm --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto ... |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.7756|± |0.0115| | | |strict-match | 5|exact_match|↑ |0.7498|± |0.0119| ``` Tags: Signed-off-by: Yang Chen <[email protected]>

chenyang78 · 2025-07-10T18:11:49Z

With the changes in this pr to force enable FlashInfer v1 with FP8 kv cache enabled, I'm seeing error below in moe fp8 models, to me llama3 fp8 works fine.

@Daisy-Ma-coder I am able to repro the failure on H100. On B200, looks like it works with the fix (#20746) now. I got the output below using your example:

Prompt: 'I believe the meaning of life is'
Generated: ' to find your gift. The purpose of life is to give it away.\n\nPicasso said, “The purpose of art is washing the dust of daily life off our souls.”\n\nI believe the purpose of life is to find your gift, your passion, and share it with the world.\n\nI believe the purpose of life is to find your gift, your passion, and share it with the world.\n\nI ..."

Daisy-Ma-coder · 2025-07-10T18:17:24Z

With the changes in this pr to force enable FlashInfer v1 with FP8 kv cache enabled, I'm seeing error below in moe fp8 models, to me llama3 fp8 works fine.

@Daisy-Ma-coder I am able to repro the failure on H100. On B200, looks like it works with the fix (#20746) now. I got the output below using your example:
Prompt: 'I believe the meaning of life is'
Generated: ' to find your gift. The purpose of life is to give it away.\n\nPicasso said, “The purpose of art is washing the dust of daily life off our souls.”\n\nI believe the purpose of life is to find your gift, your passion, and share it with the world.\n\nI believe the purpose of life is to find your gift, your passion, and share it with the world.\n\nI ..."

got it, thanks! I'm on H200s, so likely I'll still run into same error with your fix, but I can try it out.

chenyang78 · 2025-07-10T18:22:26Z

With the changes in this pr to force enable FlashInfer v1 with FP8 kv cache enabled, I'm seeing error below in moe fp8 models, to me llama3 fp8 works fine.

@Daisy-Ma-coder I am able to repro the failure on H100. On B200, looks like it works with the fix (#20746) now. I got the output below using your example:
Prompt: 'I believe the meaning of life is'
Generated: ' to find your gift. The purpose of life is to give it away.\n\nPicasso said, “The purpose of art is washing the dust of daily life off our souls.”\n\nI believe the purpose of life is to find your gift, your passion, and share it with the world.\n\nI believe the purpose of life is to find your gift, your passion, and share it with the world.\n\nI ..."
got it, thanks! I'm on H200s, so likely I'll still run into same error with your fix, but I can try it out.

Yeah, it's very likely you will still see the same issue on H200. I will investigate it in a couple of days.

This is for resolving an issue encountered while enabling fp8 kv-cache support in the flashinfer backend: vllm-project/vllm#17005 (comment) The root cause seems to be that we do not have native fp8 kv-cache support for prefill. The failure that we hit, i.e. ``` static_assert(sizeof(ElementA) == 0, "No eligible GMMA operator for request configuration."); ``` is just a fact that our prefill kernels do not instantiate cute::GMMA::rs_op_selector with the correct layout for fp8, which requires k-major for B matrix: https://github.com/flashinfer-ai/flashinfer/blob/main/include/flashinfer/attention/hopper/kernel_traits.cuh#L78 Note that we cannot simply assign k-major when DTypeKV is of fp8. There are more to fix to correctly support fp8 kv-cache in the kernel. So, this comes to the workaround in this PR, where we convert k and v to q_data_type if they are fp8 but q is not. We can do this from vllm, but I think it seems to be better to put it in flashinfer, because we do not require any changes to the customer code if we support fp8 kv-cache for prefill in a better way. Also please note that I am not 100% sure if this is an appropriate fix, particularly I am not familiar with flashinfer's code base. Originally, I was a bit worried about the impact to other kv-cache related things such as _paged_kv_indptr and _kv_indptr_buf. It seems to be fine to me after reading through the relevant code in prefill.py and hopper/prefill_sm90.cuh. Last note - eventually, I think we might need to support fp8 kv-cache for prefill more appropriately.

mergify · 2025-07-25T03:12:41Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mgoin.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Enable FlashInfer V1 FP8 kv cache

0da7076

Signed-off-by: mgoin <[email protected]>

mgoin marked this pull request as draft April 22, 2025 20:48

mgoin mentioned this pull request Apr 22, 2025

[V1] V1 FlashInfer Attention #16684

Merged

mergify bot added needs-rebase and removed needs-rebase labels Apr 25, 2025

Merge branch 'main' into flashinfer-v1-fp8-kv

e7dffa6

Merge branch 'main' into flashinfer-v1-fp8-kv

555f14d

chenyang78 mentioned this pull request Jul 8, 2025

bugfix: support uint8_t for vec_t class template flashinfer-ai/flashinfer#1234

Merged

5 tasks

chenyang78 mentioned this pull request Jul 10, 2025

[V1] Fixed fp8 kv-cache issues for the FlashInfer attention backend #20746

Closed

4 tasks

chenyang78 mentioned this pull request Jul 22, 2025

fix: a workaround to make fp8 kv-cache work for prefill flashinfer-ai/flashinfer#1304

Open

2 tasks

mergify bot added the needs-rebase label Jul 25, 2025

mgoin closed this Jul 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Enable FlashInfer V1 FP8 kv cache #17005

Enable FlashInfer V1 FP8 kv cache #17005

Uh oh!

mgoin commented Apr 22, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Apr 22, 2025

Uh oh!

JaheimLee commented Apr 23, 2025

Uh oh!

mergify bot commented Apr 25, 2025

Uh oh!

sjuxax commented May 28, 2025 •

edited

Loading

Uh oh!

sjuxax commented May 28, 2025

Uh oh!

Daisy-Ma-coder commented Jun 23, 2025 •

edited

Loading

Uh oh!

chenyang78 commented Jul 8, 2025

Uh oh!

chenyang78 commented Jul 10, 2025

Uh oh!

Daisy-Ma-coder commented Jul 10, 2025

Uh oh!

chenyang78 commented Jul 10, 2025

Uh oh!

mergify bot commented Jul 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Enable FlashInfer V1 FP8 kv cache #17005

Enable FlashInfer V1 FP8 kv cache #17005

Uh oh!

Conversation

mgoin commented Apr 22, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Apr 22, 2025

Uh oh!

JaheimLee commented Apr 23, 2025

Uh oh!

mergify bot commented Apr 25, 2025

Uh oh!

sjuxax commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sjuxax commented May 28, 2025

Uh oh!

Daisy-Ma-coder commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chenyang78 commented Jul 8, 2025

Uh oh!

chenyang78 commented Jul 10, 2025

Uh oh!

Daisy-Ma-coder commented Jul 10, 2025

Uh oh!

chenyang78 commented Jul 10, 2025

Uh oh!

mergify bot commented Jul 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mgoin commented Apr 22, 2025 •

edited by github-actions bot

Loading

sjuxax commented May 28, 2025 •

edited

Loading

Daisy-Ma-coder commented Jun 23, 2025 •

edited

Loading