-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
🐛 Bug
On macos Ventura mlc_llm fails with:
InternalError: Check failed: (state != nil) is false: cannot get state: for function fused_dequantize1_NT_matmul5_kernel_2SC compilation failure
There is a call to an undefined label
in chat/rest mode.
I've tried every accessible mcl_ai/mlc_llm whl pair: current nightly, v0.18.1, v0.17.2, v0.17.1 with _cpu suffix and nightly 0.15 without the suffix, but error is the same. I've tried different models, sometimes there is fused_dequantize1_NT_matmul1_… function instead of fused_dequantize1_NT_matmul5_…, but error persists.
There was another error on Catalina: something about unsupported metal version 2.3.
To Reproduce
Steps to reproduce the behavior:
- Follow the install guide for macos using Option 1. Prebuilt Package
- Verify installation
- note: it shows strange warnings:
python -c "import mlc_llm; print(mlc_llm)" [22:03:55] /Users/runner/work/package/package/tvm/src/target/llvm/llvm_instance.cc:226: Error: Using LLVM 19.1.5 with `-mcpu=apple-latest` is not valid in `-mtriple=arm64-apple-macos`, using default `-mcpu=generic` [22:03:55] /Users/runner/work/package/package/tvm/src/target/llvm/llvm_instance.cc:226: Error: Using LLVM 19.1.5 with `-mcpu=apple-latest` is not valid in `-mtriple=arm64-apple-macos`, using default `-mcpu=generic` [22:03:55] /Users/runner/work/package/package/tvm/src/target/llvm/llvm_instance.cc:226: Error: Using LLVM 19.1.5 with `-mcpu=apple-latest` is not valid in `-mtriple=arm64-apple-macos`, using default `-mcpu=generic` <module 'mlc_llm' from '/Volumes/Seagate/proj/mlc12/lib/python3.12/site-packages/mlc_llm/__init__.py'> - Download model (HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC)
- Run
mlc_llm chat --overrides "prefill_chunk_size=4096" ./- note: i had to change
prefill_chunk_size, because with default value there is "not enough GPU memory" error.
- note: i had to change
Model compiles. but fails when chat starts:
Details
mlc_llm chat --overrides "prefill_chunk_size=4096" ./
[2024-12-31 16:56:49] INFO auto_device.py:88: Not found device: cuda:0
[2024-12-31 16:56:50] INFO auto_device.py:88: Not found device: rocm:0
[2024-12-31 16:56:51] INFO auto_device.py:79: Found device: metal:0
[2024-12-31 16:56:52] INFO auto_device.py:88: Not found device: vulkan:0
[2024-12-31 16:56:54] INFO auto_device.py:88: Not found device: opencl:0
[2024-12-31 16:56:54] INFO auto_device.py:35: Using device: metal:0
[2024-12-31 16:56:54] INFO jit.py:43: MLC_JIT_POLICY = ON. Can be one of: ON, OFF, REDO, READONLY
[2024-12-31 16:56:54] INFO jit.py:118: Compiling using commands below:
[2024-12-31 16:56:54] INFO jit.py:119: /Volumes/Seagate/proj/env-mlc11/bin/python -m mlc_llm compile . --opt 'flashinfer=1;cublas_gemm=1;faster_transformer=0;cudagraph=1;cutlass=1;ipc_allreduce_strategy=NONE' --overrides prefill_chunk_size=4096 --device metal:0 --output /var/folders/0t/qmgg_m_d0wg8lc5_c2npdt400000gn/T/tmphxc7xpnk/lib.dylib
[2024-12-31 16:56:55] INFO auto_config.py:70: Found model configuration: mlc-chat-config.json
[2024-12-31 16:56:55] INFO auto_target.py:91: Detecting target device: metal:0
[2024-12-31 16:56:55] INFO auto_target.py:93: Found target: {"thread_warp_size": runtime.BoxInt(1), "max_threads_per_block": runtime.BoxInt(1024), "max_function_args": runtime.BoxInt(31), "max_num_threads": runtime.BoxInt(256), "kind": "metal", "max_shared_memory_per_block": runtime.BoxInt(32768), "tag": "", "keys": ["metal", "gpu"]}
[2024-12-31 16:56:55] INFO auto_target.py:110: Found host LLVM triple: x86_64-apple-darwin22.6.0
[2024-12-31 16:56:55] INFO auto_target.py:111: Found host LLVM CPU: skylake
[2024-12-31 16:56:55] INFO auto_config.py:154: Found model type: llama. Use `--model-type` to override.
Compiling with arguments:
--config LlamaConfig(hidden_size=4096, intermediate_size=14336, num_attention_heads=32, num_hidden_layers=32, rms_norm_eps=1e-05, vocab_size=128256, tie_word_embeddings=False, position_embedding_base=500000.0, rope_scaling=None, context_window_size=8192, prefill_chunk_size=8192, num_key_value_heads=8, head_dim=128, tensor_parallel_shards=1, pipeline_parallel_stages=1, max_batch_size=128, kwargs={})
--quantization GroupQuantize(name='q4f16_1', kind='group-quant', group_size=32, quantize_dtype='int4', storage_dtype='uint32', model_dtype='float16', linear_weight_layout='NK', quantize_embedding=True, quantize_final_fc=True, num_elem_per_storage=8, num_storage_per_group=4, max_int_value=7, tensor_parallel_shards=0)
--model-type llama
--target {"thread_warp_size": runtime.BoxInt(1), "host": {"mtriple": "x86_64-apple-darwin22.6.0", "tag": "", "kind": "llvm", "mcpu": "skylake", "keys": ["cpu"]}, "max_threads_per_block": runtime.BoxInt(1024), "max_function_args": runtime.BoxInt(31), "max_num_threads": runtime.BoxInt(256), "kind": "metal", "max_shared_memory_per_block": runtime.BoxInt(32768), "tag": "", "keys": ["metal", "gpu"]}
--opt flashinfer=0;cublas_gemm=0;faster_transformer=0;cudagraph=0;cutlass=0;ipc_allreduce_strategy=NONE
--system-lib-prefix ""
--output /var/folders/0t/qmgg_m_d0wg8lc5_c2npdt400000gn/T/tmphxc7xpnk/lib.dylib
--overrides context_window_size=None;sliding_window_size=None;prefill_chunk_size=4096;attention_sink_size=None;max_batch_size=None;tensor_parallel_shards=None;pipeline_parallel_stages=None
[2024-12-31 16:56:55] INFO config.py:107: Overriding prefill_chunk_size from 8192 to 4096
[2024-12-31 16:56:55] INFO compile.py:140: Creating model from: LlamaConfig(hidden_size=4096, intermediate_size=14336, num_attention_heads=32, num_hidden_layers=32, rms_norm_eps=1e-05, vocab_size=128256, tie_word_embeddings=False, position_embedding_base=500000.0, rope_scaling=None, context_window_size=8192, prefill_chunk_size=8192, num_key_value_heads=8, head_dim=128, tensor_parallel_shards=1, pipeline_parallel_stages=1, max_batch_size=128, kwargs={})
[2024-12-31 16:56:55] INFO compile.py:158: Exporting the model to TVM Unity compiler
[2024-12-31 16:56:58] INFO compile.py:164: Running optimizations using TVM Unity
[2024-12-31 16:56:58] INFO compile.py:185: Registering metadata: {'model_type': 'llama', 'quantization': 'q4f16_1', 'context_window_size': 8192, 'sliding_window_size': -1, 'attention_sink_size': -1, 'prefill_chunk_size': 4096, 'tensor_parallel_shards': 1, 'pipeline_parallel_stages': 1, 'kv_state_kind': 'kv_cache', 'max_batch_size': 128}
[2024-12-31 16:56:59] INFO pipeline.py:54: Running TVM Relax graph-level optimizations
[2024-12-31 16:57:03] INFO pipeline.py:54: Lowering to TVM TIR kernels
[2024-12-31 16:57:09] INFO pipeline.py:54: Running TVM TIR-level optimizations
[2024-12-31 16:57:29] INFO pipeline.py:54: Running TVM Dlight low-level optimizations
[2024-12-31 16:57:38] INFO pipeline.py:54: Lowering to VM bytecode
[2024-12-31 16:57:42] INFO estimate_memory_usage.py:58: [Memory usage] Function `alloc_embedding_tensor`: 32.00 MB
[2024-12-31 16:57:42] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_decode`: 18.50 MB
[2024-12-31 16:57:42] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_decode_to_last_hidden_states`: 19.50 MB
[2024-12-31 16:57:42] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_prefill`: 593.00 MB
[2024-12-31 16:57:42] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_prefill_to_last_hidden_states`: 624.00 MB
[2024-12-31 16:57:42] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_select_last_hidden_states`: 1.00 MB
[2024-12-31 16:57:42] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_verify`: 592.00 MB
[2024-12-31 16:57:42] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_verify_to_last_hidden_states`: 624.00 MB
[2024-12-31 16:57:42] INFO estimate_memory_usage.py:58: [Memory usage] Function `create_tir_paged_kv_cache`: 0.00 MB
[2024-12-31 16:57:42] INFO estimate_memory_usage.py:58: [Memory usage] Function `decode`: 0.14 MB
[2024-12-31 16:57:42] INFO estimate_memory_usage.py:58: [Memory usage] Function `decode_to_last_hidden_states`: 0.15 MB
[2024-12-31 16:57:42] INFO estimate_memory_usage.py:58: [Memory usage] Function `embed`: 32.00 MB
[2024-12-31 16:57:42] INFO estimate_memory_usage.py:58: [Memory usage] Function `gather_hidden_states`: 0.00 MB
[2024-12-31 16:57:42] INFO estimate_memory_usage.py:58: [Memory usage] Function `get_logits`: 0.00 MB
[2024-12-31 16:57:42] INFO estimate_memory_usage.py:58: [Memory usage] Function `prefill`: 592.01 MB
[2024-12-31 16:57:42] INFO estimate_memory_usage.py:58: [Memory usage] Function `prefill_to_last_hidden_states`: 624.00 MB
[2024-12-31 16:57:42] INFO estimate_memory_usage.py:58: [Memory usage] Function `scatter_hidden_states`: 0.00 MB
[2024-12-31 16:57:42] INFO estimate_memory_usage.py:58: [Memory usage] Function `softmax_with_temperature`: 0.00 MB
[2024-12-31 16:57:44] INFO pipeline.py:54: Compiling external modules
[2024-12-31 16:57:44] INFO pipeline.py:54: Compilation complete! Exporting to disk
[2024-12-31 16:57:50] INFO model_metadata.py:95: Total memory usage without KV cache:: 4932.13 MB (Parameters: 4308.13 MB. Temporary buffer: 624.00 MB)
[2024-12-31 16:57:50] INFO model_metadata.py:103: To reduce memory usage, tweak `prefill_chunk_size`, `context_window_size` and `sliding_window_size`
[2024-12-31 16:57:50] INFO compile.py:207: Generated: /var/folders/0t/qmgg_m_d0wg8lc5_c2npdt400000gn/T/tmphxc7xpnk/lib.dylib
[2024-12-31 16:57:50] INFO jit.py:126: Using compiled model lib: /Users/zxk/.cache/mlc_llm/model_lib/ef8e85d8f28ab72418b1cbacbca56dc8.dylib
[16:57:51] /Users/runner/work/package/package/mlc-llm/cpp/serve/config.cc:688: Under mode "local", max batch size will be set to 4, max KV cache token capacity will be set to 8192, prefill chunk size 4096 is specified by user.
[16:57:51] /Users/runner/work/package/package/mlc-llm/cpp/serve/config.cc:688: Under mode "interactive", max batch size will be set to 1, max KV cache token capacity will be set to 8192, prefill chunk size 4096 is specified by user.
[16:57:51] /Users/runner/work/package/package/mlc-llm/cpp/serve/config.cc:688: Under mode "server", max batch size will be set to 128, max KV cache token capacity will be set to 7575, prefill chunk size 4096 is specified by user.
[16:57:51] /Users/runner/work/package/package/mlc-llm/cpp/serve/config.cc:769: The actual engine mode is "interactive". So max batch size is 1, max KV cache token capacity is 8192, prefill chunk size is 4096.
[16:57:51] /Users/runner/work/package/package/mlc-llm/cpp/serve/config.cc:774: Estimated total single GPU memory usage: 6775.170 MB (Parameters: 4308.133 MB. KVCache: 1153.041 MB. Temporary buffer: 1313.996 MB). The actual usage might be slightly larger than the estimated number.
You can use the following special commands:
/help print the special commands
/exit quit the cli
/stats print out stats of last request (token/sec)
/metrics print out full engine metrics
/reset restart a fresh chat
/set [overrides] override settings in the generation config. For example,
`/set temperature=0.5;top_p=0.8;seed=23;max_tokens=100;stop=str1,str2`
Note: Separate stop words in the `stop` option with commas (,).
Multi-line input: Use escape+enter to start a new line.
Exception in thread Thread-1:
Traceback (most recent call last):
File "/Volumes/Seagate/proj/env-mlc11/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
self.run()
File "/Volumes/Seagate/proj/env-mlc11/lib/python3.11/threading.py", line 982, in run
self._target(*self._args, **self._kwargs)
File "tvm/_ffi/_cython/./packed_func.pxi", line 339, in tvm._ffi._cy3.core.PackedFuncBase.__call__
File "tvm/_ffi/_cython/./packed_func.pxi", line 270, in tvm._ffi._cy3.core.FuncCall
File "tvm/_ffi/_cython/./packed_func.pxi", line 259, in tvm._ffi._cy3.core.FuncCall3
File "tvm/_ffi/_cython/./base.pxi", line 185, in tvm._ffi._cy3.core.CHECK_CALL
File "/Volumes/Seagate/proj/env-mlc11/lib/python3.11/site-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
raise py_err
tvm.error.InternalError: Traceback (most recent call last):
File "/Users/runner/work/package/package/tvm/src/runtime/metal/metal_module.mm", line 130
InternalError: Check failed: (state != nil) is false: cannot get state: for function fused_dequantize1_NT_matmul5_kernel_2SC compilation failure
There is a call to an undefined label
Same with serve mode and other models.
Expected behavior
Chat works.
Environment
- Platform (e.g. WebGPU/Vulkan/IOS/Android/CUDA): Metal
- Operating system (e.g. Ubuntu/Windows/MacOS/...): macOS Ventura 13.7.2
- Device (e.g. iPhone 12 Pro, PC+RTX 3090, ...): iMac 2020 + AMD Radeon Pro 5500 XT 8 Gb
- How you installed MLC-LLM (
conda, source): pip (tried every stable+nightly version) - How you installed TVM-Unity (
pip, source): pip (tried every stable+nightly version) - Python version (e.g. 3.10): 3.11
- GPU driver version (if applicable): -
- CUDA/cuDNN version (if applicable): -
- TVM Unity Hash Tag (
python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))", applicable if you compile models): - - Any other relevant information:
Additional context
It seems that the problem is not new. The oldest mlc-ai/mlc-llm pair I was able to test is from September:
pip list | grep mlc
mlc-ai-nightly 0.15.dev570
mlc-llm-nightly 0.1.dev1524
The stable version of mlc_ai_cpu-0.15.1-cp311-cp311-macosx_10_15_x86_64 (from August) has no mlc_llm_… pair, so I cannot test it, it cannot work with 0.17+ mlc_llm.