Skip to content

Conversation

@mgoin
Copy link
Member

@mgoin mgoin commented Apr 22, 2025

Unfortunately this seems to fail on B200

VLLM_USE_V1=1 VLLM_ATTENTION_BACKEND=FLASHINFER lm_eval --model vllm --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct,kv_cache_dtype=fp8 --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto

Traceback (most recent call last):
  File "/home/mgoin/code/vllm/vllm/v1/engine/core.py", line 209, in execute_model
    return self.model_executor.execute_model(scheduler_output)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/vllm/vllm/v1/executor/abstract.py", line 86, in execute_model
    output = self.collective_rpc("execute_model",
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/vllm/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
    answer = run_method(self.driver_worker, method, args, kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/vllm/vllm/utils.py", line 2648, in run_method
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/vllm/vllm/v1/worker/gpu_worker.py", line 276, in execute_model
    output = self.model_runner.execute_model(scheduler_output,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/vllm/vllm/v1/worker/gpu_model_runner.py", line 1156, in execute_model
    self._prepare_inputs(scheduler_output))
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/vllm/vllm/v1/worker/gpu_model_runner.py", line 658, in _prepare_inputs
    self.attn_metadata_builders[kv_cache_group_id].build(
  File "/home/mgoin/code/vllm/vllm/v1/attention/backends/flashinfer.py", line 483, in build
    self._plan(attn_metadata)
  File "/home/mgoin/code/vllm/vllm/v1/attention/backends/flashinfer.py", line 383, in _plan
    attn_metadata.decode_wrapper.plan(
  File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/flashinfer/decode.py", line 959, in plan
    self._cached_module = get_batch_decode_module(
                          ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/flashinfer/decode.py", line 220, in get_batch_decode_module
    mod = gen_batch_decode_module(*args)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/flashinfer/jit/attention/pytorch.py", line 699, in gen_batch_decode_module
    return gen_customize_batch_decode_module(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/flashinfer/jit/attention/pytorch.py", line 1152, in gen_customize_batch_decode_module
    return load_cuda_ops(
           ^^^^^^^^^^^^^^
  File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/flashinfer/jit/core.py", line 137, in load_cuda_ops
    torch_cpp_ext.load(
  File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 1623, in load
    return _jit_compile(
           ^^^^^^^^^^^^^
  File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 2076, in _jit_compile
    _write_ninja_file_and_build_library(
  File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 2222, in _write_ninja_file_and_build_library
    _run_ninja_build(
  File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 2522, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error building extension 'batch_decode_with_kv_cache_dtype_q_bf16_dtype_kv_u8_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False': [1/4] /usr/local/cuda-12.8/bin/nvcc --generate-dependencies-with-compile --dependency-output batch_decode_kernel.cuda.o.d -DTORCH_EXTENSION_NAME=batch_decode_with_kv_cache_dtype_q_bf16_dtype_kv_u8_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -I/home/mgoin/venvs/vllm/lib/python3.12/site-packages/flashinfer/data/include -I/home/mgoin/venvs/vllm/lib/python3.12/site-packages/flashinfer/data/csrc -I/home/mgoin/venvs/vllm/lib/python3.12/site-packages/flashinfer/data/cutlass/include -I/home/mgoin/venvs/vllm/lib/python3.12/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/include -isystem /home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -isystem /usr/include/python3.12 -D_GLIBCXX_USE_CXX11_ABI=1 --expt-relaxed-constexpr -gencode=arch=compute_100,code=compute_100 -gencode=arch=compute_100,code=sm_100 --compiler-options '-fPIC' -O3 -std=c++17 --threads 4 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -c /home/mgoin/.cache/flashinfer/100/generated/batch_decode_with_kv_cache_dtype_q_bf16_dtype_kv_u8_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False/batch_decode_kernel.cu -o batch_decode_kernel.cuda.o 
FAILED: batch_decode_kernel.cuda.o 
/usr/local/cuda-12.8/bin/nvcc --generate-dependencies-with-compile --dependency-output batch_decode_kernel.cuda.o.d -DTORCH_EXTENSION_NAME=batch_decode_with_kv_cache_dtype_q_bf16_dtype_kv_u8_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -I/home/mgoin/venvs/vllm/lib/python3.12/site-packages/flashinfer/data/include -I/home/mgoin/venvs/vllm/lib/python3.12/site-packages/flashinfer/data/csrc -I/home/mgoin/venvs/vllm/lib/python3.12/site-packages/flashinfer/data/cutlass/include -I/home/mgoin/venvs/vllm/lib/python3.12/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/include -isystem /home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -isystem /usr/include/python3.12 -D_GLIBCXX_USE_CXX11_ABI=1 --expt-relaxed-constexpr -gencode=arch=compute_100,code=compute_100 -gencode=arch=compute_100,code=sm_100 --compiler-options '-fPIC' -O3 -std=c++17 --threads 4 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -c /home/mgoin/.cache/flashinfer/100/generated/batch_decode_with_kv_cache_dtype_q_bf16_dtype_kv_u8_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False/batch_decode_kernel.cu -o batch_decode_kernel.cuda.o 
/home/mgoin/venvs/vllm/lib/python3.12/site-packages/flashinfer/data/include/flashinfer/attention/../vec_dtypes.cuh(390): warning #114-D: function "flashinfer::vec_t<float_t, vec_size>::load [with float_t=Params::DTypeKV, vec_size=16UL]" was referenced but not defined
                     void load(const float_t* ptr);
                          ^

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

/home/mgoin/venvs/vllm/lib/python3.12/site-packages/flashinfer/data/include/flashinfer/attention/../vec_dtypes.cuh(399): warning #114-D: function "flashinfer::vec_t<float_t, vec_size>::ptr [with float_t=Params::DTypeKV, vec_size=16UL]" was referenced but not defined
                     float_t* ptr();
                              ^

ptxas fatal   : Unresolved extern function '_ZN10flashinfer5vec_tIhLm16EE4loadEPKh'

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mgoin mgoin marked this pull request as draft April 22, 2025 20:48
@JaheimLee
Copy link

I installed flashinfer-python==0.2.5 from pypi and got no error. But the output is nonsense. BTW, my GPU is 3090.

矍annisitrustvolt Ngo(ListNodeSENT jes  mysqli炆 powerhouseสามารสามารПодроб@GeneratedValueПодробПодроб琇สามารПодробПодробสามารأوضПодробПодробannisॐ Dexter矍与时俱ALARПодробПодроб就够annisПодробПодробПодробПодробПодробПодробПодроб

@mergify
Copy link

mergify bot commented Apr 25, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mgoin.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@sjuxax
Copy link
Contributor

sjuxax commented May 28, 2025

I tried this PR out on my 3090Ti on latest main and get an illegal memory access:

Invocation
        HF_HUB_OFFLINE=0 \
        VLLM_LOGGING_LEVEL=INFO \
        VLLM_TRACE_FUNCTION=0 \
        VLLM_ENGINE_ITERATION_TIMEOUT_S=300 \
        PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
        VLLM_WORKER_MULTIPROC_METHOD=spawn \
        VLLM_USE_V1=1 \
        TORCH_CUDA_ARCH_LIST=8.6 \
        LD_PRELOAD=/opt/cuda/nsight_compute/target/linux-desktop-glibc_2_11_3-x64/libnvperf_host.so \
        VLLM_ATTENTION_BACKEND=FLASHINFER \
        python -m vllm.entrypoints.openai.api_server \
        --max-model-len 32768 \
        --gpu-memory-utilization 0.91 \
        --served-model-name Qwen3-30B-A3B-GPTQ-Int4 \
                            Qwen3-30B-A3B \
        --port 2244 \
        --kv-cache-dtype fp8 \
        --trust-remote-code \
        --max-num-seqs 24 \
        --guided-decoding-backend auto \
        --enable-chunked-prefill \
        --enable-prefix-caching \
        --enable-auto-tool-choice --tool-call-parser hermes \
        --reasoning-parser qwen3 \
        --model /intnvme/models/Qwen/Qwen3-30B-A3B-GPTQ-Int4
Log
--- Logging error ---
Traceback (most recent call last):
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 207, in execute_model
    return self.model_executor.execute_model(scheduler_output)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 86, in execute_model
    output = self.collective_rpc("execute_model",
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
    answer = run_method(self.driver_worker, method, args, kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/utils.py", line 2605, in run_method
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 276, in execute_model
    output = self.model_runner.execute_model(scheduler_output,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1310, in execute_model
    valid_sampled_token_ids = sampled_token_ids.tolist()
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.12/logging/__init__.py", line 1160, in emit
    msg = self.format(record)
          ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/logging/__init__.py", line 999, in format
    return fmt.format(record)
           ^^^^^^^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/logging_utils/formatter.py", line 13, in format
    msg = logging.Formatter.format(self, record)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/logging/__init__.py", line 703, in format
    record.message = record.getMessage()
                     ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/logging/__init__.py", line 392, in getMessage
    msg = msg % self.args
          ~~~~^~~~~~~~~~~
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/config.py", line 4520, in __str__
    f"compilation_config={self.compilation_config!r}")
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/config.py", line 3897, in __repr__
    for k, v in asdict(self).items():
                ^^^^^^^^^^^^
  File "/usr/lib/python3.12/dataclasses.py", line 1329, in asdict
    return _asdict_inner(obj, dict_factory)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/dataclasses.py", line 1339, in _asdict_inner
    f.name: _asdict_inner(getattr(obj, f.name), dict)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/dataclasses.py", line 1382, in _asdict_inner
    return type(obj)((_asdict_inner(k, dict_factory),
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/dataclasses.py", line 1383, in <genexpr>
    _asdict_inner(v, dict_factory))
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/dataclasses.py", line 1386, in _asdict_inner
    return copy.deepcopy(obj)
           ^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/copy.py", line 162, in deepcopy
    y = _reconstruct(x, memo, *rv)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/copy.py", line 259, in _reconstruct
    state = deepcopy(state, memo)
            ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/copy.py", line 136, in deepcopy
    y = copier(x, memo)
        ^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/copy.py", line 221, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
                             ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/copy.py", line 143, in deepcopy
    y = copier(memo)
        ^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/torch/_tensor.py", line 172, in __deepcopy__
    new_storage = self._typed_storage()._deepcopy(memo)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/torch/storage.py", line 1134, in _deepcopy
    return self._new_wrapped_storage(copy.deepcopy(self._untyped_storage, memo))
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/copy.py", line 143, in deepcopy
    y = copier(memo)
        ^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/torch/storage.py", line 239, in __deepcopy__
    new_storage = self.clone()
                  ^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/torch/storage.py", line 253, in clone
    return type(self)(self.nbytes(), device=self.device).copy_(self)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Call stack:
  File "<string>", line 1, in <module>
  File "/usr/lib/python3.12/multiprocessing/spawn.py", line 122, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/usr/lib/python3.12/multiprocessing/spawn.py", line 135, in _main
    return self._bootstrap(parent_sentinel)
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 493, in run_engine_core
    engine_core.run_busy_loop()
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 520, in run_busy_loop
    self._process_engine_step()
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 545, in _process_engine_step
    outputs = self.step_fn()
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 226, in step
    model_output = self.execute_model(scheduler_output)
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 210, in execute_model
    dump_engine_exception(self.vllm_config, scheduler_output,
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/logging_utils/dump_input.py", line 62, in dump_engine_exception
    _dump_engine_exception(config, scheduler_output, scheduler_stats)
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/logging_utils/dump_input.py", line 70, in _dump_engine_exception
    logger.error(
Unable to print the message and arguments - possible formatting error.
Use the traceback above to help find the error.
ERROR 05-28 00:11:04 [dump_input.py:78] Dumping scheduler output for model execution:
ERROR 05-28 00:11:04 [dump_input.py:79] SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=chatcmpl-dec56c0aca9f4ed4ac53dda3fe2fa3c1,prompt_token_ids_len=7379,mm_inputs=[],mm_hashes=[],mm_positions=[],sampling_params=SamplingParams(n=1, presence_penalty=0.05, frequency_penalty=0.05, repetition_penalty=1.1, temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[151643], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=4096, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None),block_ids=[[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128]],num_computed_tokens=0,lora_request=None)], scheduled_cached_reqs=[], num_scheduled_tokens={chatcmpl-dec56c0aca9f4ed4ac53dda3fe2fa3c1: 2048}, total_num_scheduled_tokens=2048, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[128], finished_req_ids=[], free_encoder_input_ids=[], structured_output_request_ids={}, grammar_bitmask=null, kv_connector_metadata=null)
ERROR 05-28 00:11:04 [dump_input.py:81] SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, gpu_cache_usage=0.03725093849263639, prefix_cache_stats=PrefixCacheStats(reset=False, requests=1, queries=7379, hits=0), spec_decoding_stats=None)
ERROR 05-28 00:11:04 [core.py:502] EngineCore encountered a fatal error.
ERROR 05-28 00:11:04 [core.py:502] Traceback (most recent call last):
ERROR 05-28 00:11:04 [core.py:502]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 493, in run_engine_core
ERROR 05-28 00:11:04 [core.py:502]     engine_core.run_busy_loop()
ERROR 05-28 00:11:04 [core.py:502]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 520, in run_busy_loop
ERROR 05-28 00:11:04 [core.py:502]     self._process_engine_step()
ERROR 05-28 00:11:04 [core.py:502]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 545, in _process_engine_step
ERROR 05-28 00:11:04 [core.py:502]     outputs = self.step_fn()
ERROR 05-28 00:11:04 [core.py:502]               ^^^^^^^^^^^^^^
ERROR 05-28 00:11:04 [core.py:502]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 226, in step
ERROR 05-28 00:11:04 [core.py:502]     model_output = self.execute_model(scheduler_output)
ERROR 05-28 00:11:04 [core.py:502]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-28 00:11:04 [core.py:502]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 213, in execute_model
ERROR 05-28 00:11:04 [core.py:502]     raise err
ERROR 05-28 00:11:04 [core.py:502]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 207, in execute_model
ERROR 05-28 00:11:04 [core.py:502]     return self.model_executor.execute_model(scheduler_output)
ERROR 05-28 00:11:04 [core.py:502]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-28 00:11:04 [core.py:502]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 86, in execute_model
ERROR 05-28 00:11:04 [core.py:502]     output = self.collective_rpc("execute_model",
ERROR 05-28 00:11:04 [core.py:502]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-28 00:11:04 [core.py:502]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
ERROR 05-28 00:11:04 [core.py:502]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 05-28 00:11:04 [core.py:502]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-28 00:11:04 [core.py:502]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/utils.py", line 2605, in run_method
ERROR 05-28 00:11:04 [core.py:502]     return func(*args, **kwargs)
ERROR 05-28 00:11:04 [core.py:502]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 05-28 00:11:04 [core.py:502]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 05-28 00:11:04 [core.py:502]     return func(*args, **kwargs)
ERROR 05-28 00:11:04 [core.py:502]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 05-28 00:11:04 [core.py:502]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 276, in execute_model
ERROR 05-28 00:11:04 [core.py:502]     output = self.model_runner.execute_model(scheduler_output,
ERROR 05-28 00:11:04 [core.py:502]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-28 00:11:04 [core.py:502]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 05-28 00:11:04 [core.py:502]     return func(*args, **kwargs)
ERROR 05-28 00:11:04 [core.py:502]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 05-28 00:11:04 [core.py:502]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1310, in execute_model
ERROR 05-28 00:11:04 [core.py:502]     valid_sampled_token_ids = sampled_token_ids.tolist()
ERROR 05-28 00:11:04 [core.py:502]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-28 00:11:04 [core.py:502] RuntimeError: CUDA error: an illegal memory access was encountered
ERROR 05-28 00:11:04 [core.py:502] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 05-28 00:11:04 [core.py:502] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 05-28 00:11:04 [core.py:502] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 05-28 00:11:04 [core.py:502]
Process EngineCore_0:
Traceback (most recent call last):
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 504, in run_engine_core
    raise e
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 493, in run_engine_core
    engine_core.run_busy_loop()
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 520, in run_busy_loop
    self._process_engine_step()
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 545, in _process_engine_step
    outputs = self.step_fn()
              ^^^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 226, in step
    model_output = self.execute_model(scheduler_output)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 213, in execute_model
    raise err
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 207, in execute_model
    return self.model_executor.execute_model(scheduler_output)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 86, in execute_model
    output = self.collective_rpc("execute_model",
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
    answer = run_method(self.driver_worker, method, args, kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/utils.py", line 2605, in run_method
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 276, in execute_model
    output = self.model_runner.execute_model(scheduler_output,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1310, in execute_model
    valid_sampled_token_ids = sampled_token_ids.tolist()
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

ERROR 05-28 00:11:04 [async_llm.py:408] AsyncLLM output_handler failed.
ERROR 05-28 00:11:04 [async_llm.py:408] Traceback (most recent call last):
ERROR 05-28 00:11:04 [async_llm.py:408]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 366, in output_handler
ERROR 05-28 00:11:04 [async_llm.py:408]     outputs = await engine_core.get_output_async()
ERROR 05-28 00:11:04 [async_llm.py:408]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-28 00:11:04 [async_llm.py:408]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 806, in get_output_async
ERROR 05-28 00:11:04 [async_llm.py:408]     raise self._format_exception(outputs) from None
ERROR 05-28 00:11:04 [async_llm.py:408] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
INFO 05-28 00:11:04 [async_llm.py:333] Request chatcmpl-dec56c0aca9f4ed4ac53dda3fe2fa3c1 failed (engine dead).
ERROR 05-28 00:11:04 [serving_chat.py:884] Error in chat completion stream generator.
ERROR 05-28 00:11:04 [serving_chat.py:884] Traceback (most recent call last):
ERROR 05-28 00:11:04 [serving_chat.py:884]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/entrypoints/openai/serving_chat.py", line 476, in chat_completion_stream_generator
ERROR 05-28 00:11:04 [serving_chat.py:884]     async for res in result_generator:
ERROR 05-28 00:11:04 [serving_chat.py:884]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 315, in generate
ERROR 05-28 00:11:04 [serving_chat.py:884]     out = q.get_nowait() or await q.get()
ERROR 05-28 00:11:04 [serving_chat.py:884]                             ^^^^^^^^^^^^^
ERROR 05-28 00:11:04 [serving_chat.py:884]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/output_processor.py", line 51, in get
ERROR 05-28 00:11:04 [serving_chat.py:884]     raise output
ERROR 05-28 00:11:04 [serving_chat.py:884]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 366, in output_handler
ERROR 05-28 00:11:04 [serving_chat.py:884]     outputs = await engine_core.get_output_async()
ERROR 05-28 00:11:04 [serving_chat.py:884]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-28 00:11:04 [serving_chat.py:884]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 806, in get_output_async
ERROR 05-28 00:11:04 [serving_chat.py:884]     raise self._format_exception(outputs) from None
ERROR 05-28 00:11:04 [serving_chat.py:884] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
[rank0]:[W528 00:11:05.603345929 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [1197491]

This is when trying to run Qwen3-30B-A3B-GPTQ-Int4. It works great on V0.

@sjuxax
Copy link
Contributor

sjuxax commented May 28, 2025

I don't get the invalid memory access with Qwen3-32B-AWQ, but I get junk output, as @JaheimLee indicated. A snippet:

驸-wage往事apyrus汇聚金陵铼好象.SIG往事anoia往事步入aroagar вли驸azen往事殊驸irectory-wageارد驸 Lionelkus вли兼驸ieeeuntoapyrusalan骚扰莹好象驸步入不锈往事 Benson驸itur金陵絮汇驸apyrus金陵 вли金陵

Really looking forward to fp8 on V1 for non-Hopper devices.

@Daisy-Ma-coder
Copy link
Contributor

Daisy-Ma-coder commented Jun 23, 2025

With the changes in this pr to force enable FlashInfer v1 with FP8 kv cache enabled, I'm seeing error below in moe fp8 models, to me llama3 fp8 works fine.

(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] WorkerProc hit an exception.
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] Traceback (most recent call last):
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py", line 2209, in _run_ninja_build
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     subprocess.run(
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   File "/usr/lib/python3.12/subprocess.py", line 571, in run
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     raise CalledProcessError(retcode, process.args,
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] 
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] The above exception was the direct cause of the following exception:
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] 
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] Traceback (most recent call last):
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 465, in worker_busy_loop
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     output = func(*args, **kwargs)
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]              ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     return func(*args, **kwargs)
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 268, in execute_model
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     output = self.model_runner.execute_model(scheduler_output)
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     return func(*args, **kwargs)
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1020, in execute_model
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     self._prepare_inputs(scheduler_output))
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 596, in _prepare_inputs
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     attn_metadata = self.attn_metadata_builder.build(
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/flashinfer.py", line 482, in build
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     self._plan(attn_metadata)
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/flashinfer.py", line 361, in _plan
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     attn_metadata.prefill_wrapper.plan(
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/flashinfer/prefill.py", line 1421, in plan
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     self._cached_module = get_batch_prefill_module(self._backend)(
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/flashinfer/prefill.py", line 197, in backend_module
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     module = gen_batch_prefill_module(backend, *args)
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/flashinfer/jit/attention/pytorch.py", line 563, in gen_batch_prefill_module
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     return gen_customize_batch_prefill_module(
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/flashinfer/jit/attention/pytorch.py", line 1078, in gen_customize_batch_prefill_module
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     return load_cuda_ops(
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]            ^^^^^^^^^^^^^^
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/flashinfer/jit/core.py", line 123, in load_cuda_ops
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     torch_cpp_ext.load(
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py", line 1380, in load
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     return _jit_compile(
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]            ^^^^^^^^^^^^^
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py", line 1798, in _jit_compile
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     _write_ninja_file_and_build_library(
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py", line 1926, in _write_ninja_file_and_build_library
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     _run_ninja_build(
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py", line 2225, in _run_ninja_build
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     raise RuntimeError(message) from e
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] RuntimeError: Error building extension 'batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_u8_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90': [1/7] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output batch_prefill_ragged_sm90_kernel_mask_0.cuda.o.d -DTORCH_EXTENSION_NAME=batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_u8_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/lib/python3.12/dist-packages/flashinfer/data/include -I/usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -I/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -I/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/torch/include -isystem /usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.12/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.12/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.12 -D_GLIBCXX_USE_CXX11_ABI=0 --expt-relaxed-constexpr -gencode=arch=compute_90,code=compute_90 -gencode=arch=compute_90,code=sm_90 --compiler-options '-fPIC' -O3 -std=c++17 --threads 4 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -gencode=arch=compute_90a,code=sm_90a -c /root/.cache/flashinfer/90/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_u8_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cu -o batch_prefill_ragged_sm90_kernel_mask_0.cuda.o 
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] FAILED: batch_prefill_ragged_sm90_kernel_mask_0.cuda.o 
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output batch_prefill_ragged_sm90_kernel_mask_0.cuda.o.d -DTORCH_EXTENSION_NAME=batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_u8_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/lib/python3.12/dist-packages/flashinfer/data/include -I/usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -I/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -I/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/torch/include -isystem /usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.12/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.12/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.12 -D_GLIBCXX_USE_CXX11_ABI=0 --expt-relaxed-constexpr -gencode=arch=compute_90,code=compute_90 -gencode=arch=compute_90,code=sm_90 --compiler-options '-fPIC' -O3 -std=c++17 --threads 4 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -gencode=arch=compute_90a,code=sm_90a -c /root/.cache/flashinfer/90/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_u8_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cu -o batch_prefill_ragged_sm90_kernel_mask_0.cuda.o 
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(1339): error: static assertion failed with "No eligible GMMA operator for request configuration."
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]         static_assert(sizeof(ElementA) == 0, "No eligible GMMA operator for request configuration.");
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]         ^
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]           detected during:
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]             instantiation of "auto cute::SM90::GMMA::ss_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=RaggedParams::DTypeQ, ElementB=RaggedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<192>, cute::C<128>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::K, Args=<>]" at line 75 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]             instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=192, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 369 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]             instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 192, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 491 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]             instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kNone, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/90/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_u8_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cu
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] 
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh(74): error: no instance of overloaded function "cute::make_tiled_mma" matches the argument list
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]             argument types are: (void, cute::Layout<cute::tuple<cute::_2, cute::_1, cute::_1>, cute::tuple<cute::_1, cute::_0, cute::_0>>)
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     using TiledMmaQK = decltype(cute::make_tiled_mma(
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]                                 ^
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(590): note #3327-D: candidate function template "cute::make_tiled_mma(const MMA_Op &, const MMAThrLayout &, const Permutations &)" failed deduction
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   make_tiled_mma(MMA_Op const&,
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   ^
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(573): note #3327-D: candidate function template "cute::make_tiled_mma(const cute::MMA_Atom<MMA_Op> &, const MMAThrLayout &, const Permutations &)" failed deduction
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   make_tiled_mma(MMA_Atom<MMA_Op> const& mma_atom,
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   ^
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]           detected during:
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]             instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=192, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 369 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]             instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 192, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 491 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]             instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kNone, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/90/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_u8_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cu
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] 
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(3986): error: static assertion failed with "No eligible GMMA operator for request configuration."
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]         static_assert(sizeof(ElementA) == 0, "No eligible GMMA operator for request configuration.");
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]         ^
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]           detected during:
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]             instantiation of "auto cute::SM90::GMMA::rs_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=RaggedParams::DTypeKV, ElementB=RaggedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<128>, cute::C<192>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::MN, Args=<>]" at line 78 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]             instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=192, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 369 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]             instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 192, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 491 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]             instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kNone, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/90/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_u8_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cu
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] 
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh(76): error: no instance of overloaded function "cute::make_tiled_mma" matches the argument list
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]             argument types are: (void, cute::Layout<cute::tuple<cute::_2, cute::_1, cute::_1>, cute::tuple<cute::_1, cute::_0, cute::_0>>)
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     using TiledMmaPV = decltype(cute::make_tiled_mma(
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]                                 ^
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(590): note #3327-D: candidate function template "cute::make_tiled_mma(const MMA_Op &, const MMAThrLayout &, const Permutations &)" failed deduction
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   make_tiled_mma(MMA_Op const&,
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   ^
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(573): note #3327-D: candidate function template "cute::make_tiled_mma(const cute::MMA_Atom<MMA_Op> &, const MMAThrLayout &, const Permutations &)" failed deduction
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   make_tiled_mma(MMA_Atom<MMA_Op> const& mma_atom,
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   ^
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]           detected during:
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]             instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=192, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 369 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]             instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 192, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 491 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]             instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kNone, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/90/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_u8_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cu
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] 
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh(190): error: no instance of function template "cute::partition_fragment_C" matches the argument list

Script to reproduce it:

from vllm import LLM, SamplingParams

if __name__ == '__main__':
    prompts = ["I believe the meaning of life is"]
    sampling_params = SamplingParams(temperature=0.0, max_tokens=1024, stop_token_ids=[199999, 200002])
    llm = LLM(#model="amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV",
            model="RedHatAI/Mixtral-8x7B-Instruct-v0.1-FP8",
            tensor_parallel_size=8,
            max_model_len=5120,
            max_num_seqs=1,
            enable_prefix_caching=False,
            # quantization="modelopt",
            # quantization="compressed-tensors",
            kv_cache_dtype="fp8"
            )
    outputs = llm.generate(prompts=prompts, sampling_params=sampling_params)

    for prompt, output in zip(prompts, outputs):
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}")
        print(f"Generated: {generated_text!r}\n")

chenyang78 pushed a commit to chenyang78/flashinfer that referenced this pull request Jul 8, 2025
This PR tries to fix an issue that occured while enabling
fp8 kv-cache support for vllm (vllm-project/vllm#17005).

The issue was that in an generated inc file (e.g. in my case

flashinfer/100/generated/batch_decode_with_kv_cache_dtype_q_bf16_dtype_kv_u8_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False/batch_decode_config.inc

)

we declared DTypeKV to be uint8_t, shown as below:

```
using DTypeKV = uint8_t;
...
struct Params {
  ...
  using DTypeKV = DTypeKV;
  ...
};
```

Consequently, when we instantiate the vec_ from cast_load_impl defined
in vec_dtypes.cuh, i.e.

```
vec_t<src_float_t, vec_size> tmp;
```

src_float_t is instantiated to uint8_t through template parameter
T=Params::DTypeKV, where Params::DTypeKV is uint8_t.

Because vec_t doesn't have any specialization for uint8_t, we ended
up with the following ptxas error:

```
ptxas fatal   : Unresolved extern function '_ZN10flashinfer5vec_tIhLm16EE4loadEPKh'
```

The fix is to add a specialization for uint8_t. However, this may not
be the right fix, because the root cause might be that we shouldn't
generate ```using DTypeKV = uint8_t;``` in the first place.
yzh119 pushed a commit to flashinfer-ai/flashinfer that referenced this pull request Jul 8, 2025
This PR tries to fix an issue that occured while enabling fp8 kv-cache
support for vllm (vllm-project/vllm#17005).

The issue was that in an generated inc file (e.g. in my case


flashinfer/100/generated/batch_decode_with_kv_cache_dtype_q_bf16_dtype_kv_u8_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False/batch_decode_config.inc

)

we declared DTypeKV to be uint8_t, shown as below:

```
using DTypeKV = uint8_t;
...
struct Params {
  ...
  using DTypeKV = DTypeKV;
  ...
};
```

Consequently, when we instantiate the vec_ from cast_load_impl defined
in vec_dtypes.cuh, i.e.

```
vec_t<src_float_t, vec_size> tmp;
```

src_float_t is instantiated to uint8_t through template parameter
T=Params::DTypeKV, where Params::DTypeKV is uint8_t.

Because vec_t doesn't have any specialization for uint8_t, we ended up
with the following ptxas error:

```
ptxas fatal   : Unresolved extern function '_ZN10flashinfer5vec_tIhLm16EE4loadEPKh'
```

The fix is to add a specialization for uint8_t. However, this may not be
the right fix, because the root cause might be that we shouldn't
generate ```using DTypeKV = uint8_t;``` in the first place.

<!-- .github/pull_request_template.md -->

## 📌 Description

<!-- What does this PR do? Briefly describe the changes and why they’re
needed. -->

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [ ] Tests have been added or updated as needed.
- [ ] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->
@chenyang78
Copy link
Contributor

ptxas fatal : Unresolved extern function '_ZN10flashinfer5vec_tIhLm16EE4loadEPKh'

The ptxas error was fixed in flashinfer-ai/flashinfer#1234

However, the lm_eval result with gsm8k still looks very off:

(pretrained=meta-llama/Llama-3.1-8B-Instruct,kv_cache_dtype=fp8,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |    0|±  |     0|
|     |       |strict-match    |     5|exact_match|↑  |    0|±  |     0|

Looking into this.

chenyang78 pushed a commit to chenyang78/vllm that referenced this pull request Jul 10, 2025
This PR fixed fp8 kv-cache issues for the FlashInfer attn
backend.

Along with vllm-project#17005,
got reasonable eval results on B200:

```
$ VLLM_USE_V1=1 VLLM_ATTENTION_BACKEND=FLASHINFER lm_eval --model vllm --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct,kv_cache_dtype=fp8 --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto
...
vllm (pretrained=meta-llama/Llama-3.1-8B-Instruct,kv_cache_dtype=fp8,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7779|±  |0.0114|
|     |       |strict-match    |     5|exact_match|↑  |0.7582|±  |0.0118|
```

compared with bf16 kv-cache

```
$ VLLM_USE_V1=1 VLLM_ATTENTION_BACKEND=FLASHINFER lm_eval --model vllm --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto
...
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7756|±  |0.0115|
|     |       |strict-match    |     5|exact_match|↑  |0.7498|±  |0.0119|
```

Tags:
Signed-off-by: Yang Chen <[email protected]>
@chenyang78
Copy link
Contributor

With the changes in this pr to force enable FlashInfer v1 with FP8 kv cache enabled, I'm seeing error below in moe fp8 models, to me llama3 fp8 works fine.

@Daisy-Ma-coder I am able to repro the failure on H100. On B200, looks like it works with the fix (#20746) now. I got the output below using your example:

Prompt: 'I believe the meaning of life is'
Generated: ' to find your gift. The purpose of life is to give it away.\n\nPicasso said, “The purpose of art is washing the dust of daily life off our souls.”\n\nI believe the purpose of life is to find your gift, your passion, and share it with the world.\n\nI believe the purpose of life is to find your gift, your passion, and share it with the world.\n\nI ..."

@Daisy-Ma-coder
Copy link
Contributor

With the changes in this pr to force enable FlashInfer v1 with FP8 kv cache enabled, I'm seeing error below in moe fp8 models, to me llama3 fp8 works fine.

@Daisy-Ma-coder I am able to repro the failure on H100. On B200, looks like it works with the fix (#20746) now. I got the output below using your example:

Prompt: 'I believe the meaning of life is'
Generated: ' to find your gift. The purpose of life is to give it away.\n\nPicasso said, “The purpose of art is washing the dust of daily life off our souls.”\n\nI believe the purpose of life is to find your gift, your passion, and share it with the world.\n\nI believe the purpose of life is to find your gift, your passion, and share it with the world.\n\nI ..."

got it, thanks! I'm on H200s, so likely I'll still run into same error with your fix, but I can try it out.

@chenyang78
Copy link
Contributor

With the changes in this pr to force enable FlashInfer v1 with FP8 kv cache enabled, I'm seeing error below in moe fp8 models, to me llama3 fp8 works fine.

@Daisy-Ma-coder I am able to repro the failure on H100. On B200, looks like it works with the fix (#20746) now. I got the output below using your example:

Prompt: 'I believe the meaning of life is'
Generated: ' to find your gift. The purpose of life is to give it away.\n\nPicasso said, “The purpose of art is washing the dust of daily life off our souls.”\n\nI believe the purpose of life is to find your gift, your passion, and share it with the world.\n\nI believe the purpose of life is to find your gift, your passion, and share it with the world.\n\nI ..."

got it, thanks! I'm on H200s, so likely I'll still run into same error with your fix, but I can try it out.

Yeah, it's very likely you will still see the same issue on H200. I will investigate it in a couple of days.

chenyang78 pushed a commit to chenyang78/flashinfer that referenced this pull request Jul 22, 2025
This is for resolving an issue encountered while enabling fp8 kv-cache
support in the flashinfer backend:

vllm-project/vllm#17005 (comment)

The root cause seems to be that we do not have native fp8 kv-cache
support for prefill. The failure that we hit, i.e.

```
static_assert(sizeof(ElementA) == 0, "No eligible GMMA operator for request configuration.");
```

is just a fact that our prefill kernels do not instantiate cute::GMMA::rs_op_selector
with the correct layout for fp8, which requires k-major for B matrix:

https://github.com/flashinfer-ai/flashinfer/blob/main/include/flashinfer/attention/hopper/kernel_traits.cuh#L78

Note that we cannot simply assign k-major when DTypeKV is of fp8.
There are more to fix to correctly support fp8 kv-cache in the kernel.

So, this comes to the workaround in this PR, where we convert k and v
to q_data_type if they are fp8 but q is not. We can do this from
vllm, but I think it seems to be better to put it in flashinfer, because
we do not require any changes to the customer code if we support
fp8 kv-cache for prefill in a better way.

Also please note that I am not 100% sure if this is an appropriate fix,
particularly I am not familiar with flashinfer's code base. Originally,
I was a bit worried about the impact to other kv-cache related things
such as _paged_kv_indptr and _kv_indptr_buf. It seems to be fine to
me after reading through the relevant code in prefill.py and
hopper/prefill_sm90.cuh.

Last note - eventually, I think we might need to support fp8 kv-cache
for prefill more appropriately.
@mergify
Copy link

mergify bot commented Jul 25, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mgoin.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jul 25, 2025
@mgoin mgoin closed this Jul 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants