Skip to content

[Bug] Broken for Intel Macs since v0.15 (or earlier) #3078

@zxcat

Description

@zxcat

🐛 Bug

On macos Ventura mlc_llm fails with:

InternalError: Check failed: (state != nil) is false: cannot get state: for function fused_dequantize1_NT_matmul5_kernel_2SC compilation failure
There is a call to an undefined label

in chat/rest mode.

I've tried every accessible mcl_ai/mlc_llm whl pair: current nightly, v0.18.1, v0.17.2, v0.17.1 with _cpu suffix and nightly 0.15 without the suffix, but error is the same. I've tried different models, sometimes there is fused_dequantize1_NT_matmul1_… function instead of fused_dequantize1_NT_matmul5_…, but error persists.

There was another error on Catalina: something about unsupported metal version 2.3.

To Reproduce

Steps to reproduce the behavior:

  1. Follow the install guide for macos using Option 1. Prebuilt Package
  2. Verify installation
    • note: it shows strange warnings:
    python -c "import mlc_llm; print(mlc_llm)"
    
    [22:03:55] /Users/runner/work/package/package/tvm/src/target/llvm/llvm_instance.cc:226: Error: Using LLVM 19.1.5 with `-mcpu=apple-latest` is not valid in `-mtriple=arm64-apple-macos`, using default `-mcpu=generic`
    [22:03:55] /Users/runner/work/package/package/tvm/src/target/llvm/llvm_instance.cc:226: Error: Using LLVM 19.1.5 with `-mcpu=apple-latest` is not valid in `-mtriple=arm64-apple-macos`, using default `-mcpu=generic`
    [22:03:55] /Users/runner/work/package/package/tvm/src/target/llvm/llvm_instance.cc:226: Error: Using LLVM 19.1.5 with `-mcpu=apple-latest` is not valid in `-mtriple=arm64-apple-macos`, using default `-mcpu=generic`
    <module 'mlc_llm' from '/Volumes/Seagate/proj/mlc12/lib/python3.12/site-packages/mlc_llm/__init__.py'>
    
  3. Download model (HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC)
  4. Run mlc_llm chat --overrides "prefill_chunk_size=4096" ./
    • note: i had to change prefill_chunk_size, because with default value there is "not enough GPU memory" error.

Model compiles. but fails when chat starts:

Details

mlc_llm chat --overrides "prefill_chunk_size=4096" ./

[2024-12-31 16:56:49] INFO auto_device.py:88: Not found device: cuda:0
[2024-12-31 16:56:50] INFO auto_device.py:88: Not found device: rocm:0
[2024-12-31 16:56:51] INFO auto_device.py:79: Found device: metal:0
[2024-12-31 16:56:52] INFO auto_device.py:88: Not found device: vulkan:0
[2024-12-31 16:56:54] INFO auto_device.py:88: Not found device: opencl:0
[2024-12-31 16:56:54] INFO auto_device.py:35: Using device: metal:0
[2024-12-31 16:56:54] INFO jit.py:43: MLC_JIT_POLICY = ON. Can be one of: ON, OFF, REDO, READONLY
[2024-12-31 16:56:54] INFO jit.py:118: Compiling using commands below:
[2024-12-31 16:56:54] INFO jit.py:119: /Volumes/Seagate/proj/env-mlc11/bin/python -m mlc_llm compile . --opt 'flashinfer=1;cublas_gemm=1;faster_transformer=0;cudagraph=1;cutlass=1;ipc_allreduce_strategy=NONE' --overrides prefill_chunk_size=4096 --device metal:0 --output /var/folders/0t/qmgg_m_d0wg8lc5_c2npdt400000gn/T/tmphxc7xpnk/lib.dylib
[2024-12-31 16:56:55] INFO auto_config.py:70: Found model configuration: mlc-chat-config.json
[2024-12-31 16:56:55] INFO auto_target.py:91: Detecting target device: metal:0
[2024-12-31 16:56:55] INFO auto_target.py:93: Found target: {"thread_warp_size": runtime.BoxInt(1), "max_threads_per_block": runtime.BoxInt(1024), "max_function_args": runtime.BoxInt(31), "max_num_threads": runtime.BoxInt(256), "kind": "metal", "max_shared_memory_per_block": runtime.BoxInt(32768), "tag": "", "keys": ["metal", "gpu"]}
[2024-12-31 16:56:55] INFO auto_target.py:110: Found host LLVM triple: x86_64-apple-darwin22.6.0
[2024-12-31 16:56:55] INFO auto_target.py:111: Found host LLVM CPU: skylake
[2024-12-31 16:56:55] INFO auto_config.py:154: Found model type: llama. Use `--model-type` to override.
Compiling with arguments:
  --config          LlamaConfig(hidden_size=4096, intermediate_size=14336, num_attention_heads=32, num_hidden_layers=32, rms_norm_eps=1e-05, vocab_size=128256, tie_word_embeddings=False, position_embedding_base=500000.0, rope_scaling=None, context_window_size=8192, prefill_chunk_size=8192, num_key_value_heads=8, head_dim=128, tensor_parallel_shards=1, pipeline_parallel_stages=1, max_batch_size=128, kwargs={})
  --quantization    GroupQuantize(name='q4f16_1', kind='group-quant', group_size=32, quantize_dtype='int4', storage_dtype='uint32', model_dtype='float16', linear_weight_layout='NK', quantize_embedding=True, quantize_final_fc=True, num_elem_per_storage=8, num_storage_per_group=4, max_int_value=7, tensor_parallel_shards=0)
  --model-type      llama
  --target          {"thread_warp_size": runtime.BoxInt(1), "host": {"mtriple": "x86_64-apple-darwin22.6.0", "tag": "", "kind": "llvm", "mcpu": "skylake", "keys": ["cpu"]}, "max_threads_per_block": runtime.BoxInt(1024), "max_function_args": runtime.BoxInt(31), "max_num_threads": runtime.BoxInt(256), "kind": "metal", "max_shared_memory_per_block": runtime.BoxInt(32768), "tag": "", "keys": ["metal", "gpu"]}
  --opt             flashinfer=0;cublas_gemm=0;faster_transformer=0;cudagraph=0;cutlass=0;ipc_allreduce_strategy=NONE
  --system-lib-prefix ""
  --output          /var/folders/0t/qmgg_m_d0wg8lc5_c2npdt400000gn/T/tmphxc7xpnk/lib.dylib
  --overrides       context_window_size=None;sliding_window_size=None;prefill_chunk_size=4096;attention_sink_size=None;max_batch_size=None;tensor_parallel_shards=None;pipeline_parallel_stages=None
[2024-12-31 16:56:55] INFO config.py:107: Overriding prefill_chunk_size from 8192 to 4096
[2024-12-31 16:56:55] INFO compile.py:140: Creating model from: LlamaConfig(hidden_size=4096, intermediate_size=14336, num_attention_heads=32, num_hidden_layers=32, rms_norm_eps=1e-05, vocab_size=128256, tie_word_embeddings=False, position_embedding_base=500000.0, rope_scaling=None, context_window_size=8192, prefill_chunk_size=8192, num_key_value_heads=8, head_dim=128, tensor_parallel_shards=1, pipeline_parallel_stages=1, max_batch_size=128, kwargs={})
[2024-12-31 16:56:55] INFO compile.py:158: Exporting the model to TVM Unity compiler
[2024-12-31 16:56:58] INFO compile.py:164: Running optimizations using TVM Unity
[2024-12-31 16:56:58] INFO compile.py:185: Registering metadata: {'model_type': 'llama', 'quantization': 'q4f16_1', 'context_window_size': 8192, 'sliding_window_size': -1, 'attention_sink_size': -1, 'prefill_chunk_size': 4096, 'tensor_parallel_shards': 1, 'pipeline_parallel_stages': 1, 'kv_state_kind': 'kv_cache', 'max_batch_size': 128}
[2024-12-31 16:56:59] INFO pipeline.py:54: Running TVM Relax graph-level optimizations
[2024-12-31 16:57:03] INFO pipeline.py:54: Lowering to TVM TIR kernels
[2024-12-31 16:57:09] INFO pipeline.py:54: Running TVM TIR-level optimizations
[2024-12-31 16:57:29] INFO pipeline.py:54: Running TVM Dlight low-level optimizations
[2024-12-31 16:57:38] INFO pipeline.py:54: Lowering to VM bytecode
[2024-12-31 16:57:42] INFO estimate_memory_usage.py:58: [Memory usage] Function `alloc_embedding_tensor`: 32.00 MB
[2024-12-31 16:57:42] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_decode`: 18.50 MB
[2024-12-31 16:57:42] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_decode_to_last_hidden_states`: 19.50 MB
[2024-12-31 16:57:42] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_prefill`: 593.00 MB
[2024-12-31 16:57:42] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_prefill_to_last_hidden_states`: 624.00 MB
[2024-12-31 16:57:42] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_select_last_hidden_states`: 1.00 MB
[2024-12-31 16:57:42] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_verify`: 592.00 MB
[2024-12-31 16:57:42] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_verify_to_last_hidden_states`: 624.00 MB
[2024-12-31 16:57:42] INFO estimate_memory_usage.py:58: [Memory usage] Function `create_tir_paged_kv_cache`: 0.00 MB
[2024-12-31 16:57:42] INFO estimate_memory_usage.py:58: [Memory usage] Function `decode`: 0.14 MB
[2024-12-31 16:57:42] INFO estimate_memory_usage.py:58: [Memory usage] Function `decode_to_last_hidden_states`: 0.15 MB
[2024-12-31 16:57:42] INFO estimate_memory_usage.py:58: [Memory usage] Function `embed`: 32.00 MB
[2024-12-31 16:57:42] INFO estimate_memory_usage.py:58: [Memory usage] Function `gather_hidden_states`: 0.00 MB
[2024-12-31 16:57:42] INFO estimate_memory_usage.py:58: [Memory usage] Function `get_logits`: 0.00 MB
[2024-12-31 16:57:42] INFO estimate_memory_usage.py:58: [Memory usage] Function `prefill`: 592.01 MB
[2024-12-31 16:57:42] INFO estimate_memory_usage.py:58: [Memory usage] Function `prefill_to_last_hidden_states`: 624.00 MB
[2024-12-31 16:57:42] INFO estimate_memory_usage.py:58: [Memory usage] Function `scatter_hidden_states`: 0.00 MB
[2024-12-31 16:57:42] INFO estimate_memory_usage.py:58: [Memory usage] Function `softmax_with_temperature`: 0.00 MB
[2024-12-31 16:57:44] INFO pipeline.py:54: Compiling external modules
[2024-12-31 16:57:44] INFO pipeline.py:54: Compilation complete! Exporting to disk
[2024-12-31 16:57:50] INFO model_metadata.py:95: Total memory usage without KV cache:: 4932.13 MB (Parameters: 4308.13 MB. Temporary buffer: 624.00 MB)
[2024-12-31 16:57:50] INFO model_metadata.py:103: To reduce memory usage, tweak `prefill_chunk_size`, `context_window_size` and `sliding_window_size`
[2024-12-31 16:57:50] INFO compile.py:207: Generated: /var/folders/0t/qmgg_m_d0wg8lc5_c2npdt400000gn/T/tmphxc7xpnk/lib.dylib
[2024-12-31 16:57:50] INFO jit.py:126: Using compiled model lib: /Users/zxk/.cache/mlc_llm/model_lib/ef8e85d8f28ab72418b1cbacbca56dc8.dylib
[16:57:51] /Users/runner/work/package/package/mlc-llm/cpp/serve/config.cc:688: Under mode "local", max batch size will be set to 4, max KV cache token capacity will be set to 8192, prefill chunk size 4096 is specified by user.
[16:57:51] /Users/runner/work/package/package/mlc-llm/cpp/serve/config.cc:688: Under mode "interactive", max batch size will be set to 1, max KV cache token capacity will be set to 8192, prefill chunk size 4096 is specified by user.
[16:57:51] /Users/runner/work/package/package/mlc-llm/cpp/serve/config.cc:688: Under mode "server", max batch size will be set to 128, max KV cache token capacity will be set to 7575, prefill chunk size 4096 is specified by user.
[16:57:51] /Users/runner/work/package/package/mlc-llm/cpp/serve/config.cc:769: The actual engine mode is "interactive". So max batch size is 1, max KV cache token capacity is 8192, prefill chunk size is 4096.
[16:57:51] /Users/runner/work/package/package/mlc-llm/cpp/serve/config.cc:774: Estimated total single GPU memory usage: 6775.170 MB (Parameters: 4308.133 MB. KVCache: 1153.041 MB. Temporary buffer: 1313.996 MB). The actual usage might be slightly larger than the estimated number.
You can use the following special commands:
  /help               print the special commands
  /exit               quit the cli
  /stats              print out stats of last request (token/sec)
  /metrics            print out full engine metrics
  /reset              restart a fresh chat
  /set [overrides]    override settings in the generation config. For example,
                      `/set temperature=0.5;top_p=0.8;seed=23;max_tokens=100;stop=str1,str2`
                      Note: Separate stop words in the `stop` option with commas (,).
  Multi-line input: Use escape+enter to start a new line.

Exception in thread Thread-1:
Traceback (most recent call last):
  File "/Volumes/Seagate/proj/env-mlc11/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
    self.run()
  File "/Volumes/Seagate/proj/env-mlc11/lib/python3.11/threading.py", line 982, in run
    self._target(*self._args, **self._kwargs)
  File "tvm/_ffi/_cython/./packed_func.pxi", line 339, in tvm._ffi._cy3.core.PackedFuncBase.__call__
  File "tvm/_ffi/_cython/./packed_func.pxi", line 270, in tvm._ffi._cy3.core.FuncCall
  File "tvm/_ffi/_cython/./packed_func.pxi", line 259, in tvm._ffi._cy3.core.FuncCall3
  File "tvm/_ffi/_cython/./base.pxi", line 185, in tvm._ffi._cy3.core.CHECK_CALL
  File "/Volumes/Seagate/proj/env-mlc11/lib/python3.11/site-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
    raise py_err
tvm.error.InternalError: Traceback (most recent call last):
  File "/Users/runner/work/package/package/tvm/src/runtime/metal/metal_module.mm", line 130
InternalError: Check failed: (state != nil) is false: cannot get state: for function fused_dequantize1_NT_matmul5_kernel_2SC compilation failure
There is a call to an undefined label

Same with serve mode and other models.

Expected behavior

Chat works.

Environment

  • Platform (e.g. WebGPU/Vulkan/IOS/Android/CUDA): Metal
  • Operating system (e.g. Ubuntu/Windows/MacOS/...): macOS Ventura 13.7.2
  • Device (e.g. iPhone 12 Pro, PC+RTX 3090, ...): iMac 2020 + AMD Radeon Pro 5500 XT 8 Gb
  • How you installed MLC-LLM (conda, source): pip (tried every stable+nightly version)
  • How you installed TVM-Unity (pip, source): pip (tried every stable+nightly version)
  • Python version (e.g. 3.10): 3.11
  • GPU driver version (if applicable): -
  • CUDA/cuDNN version (if applicable): -
  • TVM Unity Hash Tag (python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))", applicable if you compile models): -
  • Any other relevant information:

Additional context

It seems that the problem is not new. The oldest mlc-ai/mlc-llm pair I was able to test is from September:

pip list | grep mlc

mlc-ai-nightly                           0.15.dev570
mlc-llm-nightly                          0.1.dev1524

The stable version of mlc_ai_cpu-0.15.1-cp311-cp311-macosx_10_15_x86_64 (from August) has no mlc_llm_… pair, so I cannot test it, it cannot work with 0.17+ mlc_llm.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugConfirmed bugs

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions