Skip to content

Commit 4e5595c

Browse files
masahidavidpissarrashenbergLunderbergcyx-6
authored
Upstream merge Nov20 (includes ft_group quantization support) (#71)
* [API] Add GenerationConfig (mlc-ai#1024) * Fix two bugs in kv-cache backtrack loop (mlc-ai#856) Fix two bugs in kv-cache pop loop Bug 1: old code would stop early because output_ids was shortened in-place during the loop Bug 2: off-by-one in backoff size due to break * [Build] Added --pdb flag to build.py, drop into pdb on error (mlc-ai#1017) This commit adds an optional `--pdb` flag to the `build.py` script. If passed, any exception raised that would otherwise terminate the script will first enter a pdb post-mortem, allowing the error to be inspected. * [Android] Use `AlertDialog` instead of `Toast` (mlc-ai#1039) * Add doc for ChatConfig, ConvConfig, GenerationConfig, BuildArgs (mlc-ai#1040) Add doc for ChatConfig, ConvConfig, GenerationConfig, BuildArgs, build model * [Android] Add Llama2 q4f16_0 (mlc-ai#1041) llama2 q4f160 * [Docs] Model prebuilts tracking page revamp (mlc-ai#1000) * Update compile_models.rst (mlc-ai#1038) fix permission issue * Support for the Stable LM 3B model (mlc-ai#1008) Support for the stablelm-3b-4e1t model * [Docs] Iterate model prebuilts docs (mlc-ai#1043) * Iterate model prebuilts docs * small fix * Update README.md * [CPP] Separate common utils out from llm_chat.cc (mlc-ai#1044) This PR separates out the tokenizer creation function, the random number generator out from `llm_chat.cc` as a preparation step for batching inference support, since these functions/modules are also used in the same way in batching inference. * Update README.md (mlc-ai#1045) Update README.md * add verbose stats to mlc-chat REST API (mlc-ai#1049) * add verbose stats to mlc-chat REST API * update docs * [Transform] Apply split_rotary optimization on prefill (mlc-ai#1033) * [Transform] Apply split_rotary optimization on prefill Prior to this commit, the `transform.fuse_split_rotary_embedding` function was only applicable to the `decode` function of a Llama-type model. This was due to the sequence length being restricted to one, both in the pattern-match rule and in the `split_rotary` function, and the function being restricted to operate only on the `decode` function. This commit updates the `transform.fuse_split_rotary_embedding` pass to be a `tvm.ir.transform.Pass`, operating on all applicable matched in the `IRModule`. The `split_rotary` function is now produced as a fully-generic function, with static parameters substituted in afterwards. At this stage, the sequence length is retained as a dynamic parameter, such that it can be used by the `prefill` function. * Avoid multiple kernel launches for split_rotary * [Docs] Add `mlc.ai/package` to `DEPENDENCY INSTALLATION` group (mlc-ai#1055) Co-authored-by: Junru Shao <[email protected]> * Revert "[Transform] Apply split_rotary optimization on prefill (mlc-ai#1033)" (mlc-ai#1058) This reverts commit b9179cf as elaborated here mlc-ai#1033 (comment) * [BugFix] Set the right `max_sequence_length` for both Llama-1 and Llama-2 families (mlc-ai#1032) * fix * reflect feedback --------- Co-authored-by: “Sunghyun <[email protected]> * [Doc] Use -U instead of --force-reinstall (mlc-ai#1062) `--force-reinstall` will reinstall all dependencies to a python package, which is unnecessary. `-U` is a better choice in this case. * [Model] Initial batching support for Llama (mlc-ai#1048) This PR introduces the initial batched input support for llama models. To make the code managable, we keep both the single-sequence handling flow and the batching handling flow in the Llama modeling. Now, with `--enable-batching` as a build argument, we build Llama for the batched version. NOTE: The paged attention kernel/TIR func are not included in this PR, so currently the built library with batching enabled is not runnable. We will follow up with the attention kernel in the future. This PR guarantees that the existing single-sequence inference (Python API, CLI, etc.) is not broken. P.S.. The batching flow is subject to bug fixes as we integrate with the attention function and run the e2e flow in the future. * Fix Stable LM 3B build (mlc-ai#1061) * [stablelm 3b] Rename dynamic vocab size from "v" to "vocab_size" * Add get_num_key_value_heads method to StableLM3bConfig * [Core] Remove duplication in MODEL.get_model calls (mlc-ai#1054) This commit removes the `if`/`elif` chain in `core.py`, where the body of each conditional assigns the same `mod, param_manager, params, model_config`, and is identical except for the choice of model being built. * [ParamManager] Cleanup creation of quantization IRModule (mlc-ai#1053) This commit replaces the single-parameter `relax_model.param_manager.create_quantize_func` function with a method on the `ParamManager`, `create_parameter_transformation`. This avoids potential typos between `param_manager` as the imported Python module `mlc_llm.relax_model.param_manager` and an instance of the `ParamManager` class named `param_manager`, and makes the functionality easier to find. This function also takes an optional `optimize_parameter_order` flag, defaulting to `True`, which applies the `ReorderTransformFunc` pass. Since the `ReorderTransformFunc` is intended to be used with several configuration objects owned by `ParamManager`, this simplifies the common path of producing an optimally-ordered parameter transformation module. * Minor typo fix (mlc-ai#1064) * Add links to Python API Reference (mlc-ai#1068) * [Fix] ChatModule incorrect temperature buffer shape (mlc-ai#1070) PR mlc-ai#1048 updated the signature of softmax in the built model library and changed the temperature buffer shape in ChatModule. This causes some existing demo unable to run since we did not do a round of model library update. This PR reverts the ChatModule change, and adds back the softmax function in non-batching case. With this PR, the regression should be fixed. * [ParamManager] Added progress bar for get_item/set_item (mlc-ai#1063) * [Python] Extract common device str parse function in ChatModule (mlc-ai#1074) This PR lifts the device string parsing (just a few of lines) to a standalone function, so that on the serving side the serving can make use of this function as well. Tested Python API and it does not seem to incur regression. * [Bugfix] Compilation Error in q4f32_1 (mlc-ai#1078) The pass `fuse-split-rotary` assumes the compute dtype is fp16, which usually is, but in certain cases, e.g. `q0f32` and `q4f32_1`, the compute is based on fp32 instead. This PR strengthens the check guard. * Establish `mlc_chat.compiler` (mlc-ai#1082) This PR establishes the compiler components in MLC-Chat Python API, which currently includes two primary components: models and parameters. The models are `nn.Module`-based definition of an LLM, which, as the very first stab, contains only `LlamaForCasualLM`. It is decomposed into three files: - `llama_config.py`: common configurations for Llama, where we define relevant configurations of its architecture, as well as include standard config file for Llama2-7B/13B/70B for convenient testing; - `llama.py`: the model architecture of Llama, based on the PyTorch-like `nn.Module` API; - `llama_parameter.py`: defines the mapping between MLC parameters and pytorch parameters. The parameters contains the basic functionality of parameter mapping, and the loaders that effectively convert parameters from PyTorch to MLC according to the mapping specified. Currently, only `HFTorchLoader` is implemented, but loaders like SafeTensor, GPTQ or AWQ should be quite straightforward according to the existing design. On top of this PR, on-the-fly quantization could be defined as a loading time transformation on MLC parameters, while pre-quantized parameter loading is effectively parameter loading after MLC's `nn.Module` is quantized. Two unittests examplify how the infrastructure works: - `./tests/python/model/test_llama.py` shows how to create an `nn.Module` using the new infra, and then convert it to TVM IRModule; - `./tests/python/parameter/hf_torch_loader.py` shows how to load parameters from HuggingFace PyTorch format. Besides, `mlc_chat.support` is established for utility functions, which now contains two utils: - `config.py` which supports reading configurations into dataclasses from JSON file or Python dict. On top of Python dataclass, it throws irrelevant fields into `cls.kwargs`, which is helpful when loading HuggingFace configuration file; - `tqdm.py` which contains tqdm-related utilities, primarily redirecting logging and printing to work nicely with tqdm. * Update README.md for Multi-GPU (mlc-ai#1090) * Support lib_path override in C++. Improvements on docs and error messages (mlc-ai#1086) * Support lib_path option in C++ CLI. Disable ChatConfig.model_lib override in Python API. Improvements on helper messages and error messages * Update docs * Rename lib_path -> model_lib_path * StreamIterator (mlc-ai#1057) Co-authored-by: Varshith <[email protected]> * Update `benchmark.py` according to mlc-ai#1086 (mlc-ai#1091) Update `benchmark.py` * Disable Disco for q4f16_ft and q8f16_ft quantization (mlc-ai#1094) * [Format] Apply isort and black for `python/` (mlc-ai#1097) [Format] Apply isort and black on `python/` The commands I am using are: ``` isort --profile black python/ black python/ ``` It is always recommended to format the code before submission, given we don't have a linter CI yet. * More formatting (mlc-ai#1099) * Enable Python Linter (mlc-ai#1098) This PR enables two Python formatters "black" and "isort" on the following directory: - `./python/` - `./tests/python/` Enabling pylint and mypy is left for future work * Add Basic Pylint and Mypy Tooling (mlc-ai#1100) Add pylint/mypy tooling into pyproject.toml This PR establishes the initial Python tooling infra with Pylint and Mypy. Currently only the newest modules, i.e. `mlc_chat.support` and `mlc_chat.compiler` are covered, and we expect to cover the entire package, as being tracked in mlc-ai#1101. * [CI] Add clang-format (mlc-ai#1103) * [Slim-LM] Smart path finding for config and weight (mlc-ai#1088) * [Transform] Provide IRModule transform for rewrite_attention (mlc-ai#1052) Prior to this commit, `mlc_llm.transform.rewrite_attention` updated a single function. This commit modifies it to instead be a transform operating on any pattern matches within an `IRModule`. * [ParamManager] Use BundleModelParams for transform_dequantize (mlc-ai#1056) * [ParamManager] Use BundleModelParams for transform_quantize Prior to this commit, `ParamManager.transform_quantize` function took as input functions with separate parameters for each weight tensor, and produced output functions with a tuple parameter for all weights. Because `LiftTransformParams` had the same convention, neither could be applied as part of the same build flow. This commit updates `ParamManager.transform_quantize` pass to produce outputs with separate tensor parameters, using the `BundleModelParams` transform to later combine them into a single tuple parameter. The analogous change was also performed for `LiftTransformParams` as part of apache/tvm#15657. In addition, prior to this commit, the `ParamManager.transform_dequantize` function operated directly on a `IRModule` object. As a result, any debug instrumentation (e.g. before/after printouts for each pass, before/after verification with `relax.analysis.well_formed`, etc.) did not apply to this `transform_dequantize`. This commit updates `ParamManager.transform_dequantize` to return a `ir.transform.Pass`. * Correct type annotation * [Slim-LM] Introduce HFLoad for loading Pytorch and SafeTensor weights (mlc-ai#1113) * [WINDOWS] reduce noise in windows build (mlc-ai#1115) * Add CLI commands for compilation (mlc-ai#1109) * Auto updated submodule references * fix mismatched argument name (mlc-ai#1117) fix error introduced by recent code changes fixes mlc-ai#1116 * [Docs] Add doc for max and mean gen len, shift factor; and buildArgs (mlc-ai#1119) * Add doc for max and mean gen len, shift factor * Update python docs for BuildArgs * Revert "[ParamManager] Use BundleModelParams for transform_dequantize" (mlc-ai#1120) Revert "[ParamManager] Use BundleModelParams for transform_dequantize (mlc-ai#1056)" This reverts commit e5927ce. This causes a regression impacting all MLC LLM nightlies as it violates the existing calling convention in MLC Chat runtime. An example: mlc-ai#1060 (comment) * Remove inaccurate warning message (mlc-ai#1121) This PR removes an inaccurate warning from mlc-ai#1086, which warns about `model_lib` overriding regardless of whether or not it's actually overridden. With this commit, we only warn if its value is not None. * [REST] OpenAI compatible Rest API (mlc-ai#1107) * add presence and frequency penalty * Added support for passing conversation history in /v1/chat/completions endpoint * Added support for RestAPI parameters max_gen_len, n, and stop_str * * add presence and frequency penalty to generation config * refactor generation config * Added documentation for parameters * replace lib_path with model_lib_path in rest.py * fixed black isort issues * fix lib_path * Add --opt flag parsing to CLI (mlc-ai#1123) * [ParamManager][Redo] Use BundleModelParams for transform_dequantize (mlc-ai#1127) Prior to this commit, `ParamManager.transform_quantize` function took as input functions with separate parameters for each weight tensor, and produced output functions with a tuple parameter for all weights. Because `LiftTransformParams` had the same convention, neither could be applied as part of the same build flow. This commit updates `ParamManager.transform_quantize` pass to produce outputs with separate tensor parameters, using the `BundleModelParams` transform to later combine them into a single tuple parameter. The analogous change was also performed for `LiftTransformParams` as part of apache/tvm#15657. In addition, prior to this commit, the `ParamManager.transform_dequantize` function operated directly on a `IRModule` object. As a result, any debug instrumentation (e.g. before/after printouts for each pass, before/after verification with `relax.analysis.well_formed`, etc.) did not apply to this `transform_dequantize`. This commit updates `ParamManager.transform_dequantize` to return a `ir.transform.Pass`. This commit is a repeat of the reverted PR mlc-ai#1056. This PR resolves the bug in the earlier implementation by removing the call to `.without_attr("num_input")` in `ParamReplacer.rewrite_func`. This follows an analogous update in `LiftTransformParams`, preserving the `"num_input"` attribute for use in `BundleModelParams`. * added details to windows installation (mlc-ai#1133) 32bit version of the zstd.dll library was causing issues, so updated the doc to be more specific and download the 64bit version. * Grammatical and Typographical improvements (mlc-ai#1139) * Update faq.rst * Update guideline.rst * Update compile_models.rst * Update distribute_compiled_models.rst * Update get-vicuna-weight.rst * Update python.rst * Update android.rst * Update cli.rst * Update ios.rst * Update javascript.rst * Update python.rst * Update rest.rst * Minor enhancements to `ChatModule` (mlc-ai#1132) Some minor enhancements to `ChatModule`, mainly handle the device parsing solely in `_parse_device_str` instead of handling it both in the member function and the `__init__` function to avoid redundancy; and some type annotation fix. * Updating tvm install docs (mlc-ai#1143) Updating the tvm install docs to assist a user in finding and copying zstd.dll to the correct folder. * Make the help info consistent with program name (mlc-ai#1137) When user use command `mlc_chat_cli --help`, the output will be something like Usage: mlc_chat [--help] ... That's because the program name specified in `cli_main.cc` is "mlc_chat". It will be less confusing if the output of help info shows Usage: mlc_chat_cli [--help] ... * Support parameter packing (mlc-ai#1146) * [Slim-LM] Enable Group Quant (mlc-ai#1129) * Enable group quant via new interface. * Minor fix. * Linting. * Fix isort. * Fix mypy. * TE compute working. * Skip embed. * Support cpu+gpu quantization. * Add target option to tests. * Linting. * Enable Mypy and Pylint in mlc_chat Python Package (mlc-ai#1149) * Migrate Compiler Passes (mlc-ai#1150) * Compile Model Preset without External `config.json` (mlc-ai#1151) This PR adds support for compiling a preset of models without having to provide a `config.json` on disk using the commands below: ```diff python -m mlc_chat.cli.compile \ --quantization q4f16_1 -o /tmp/1.so \ - --config /models/Llama-2-7b-chat-hf + --config llama2_7b ``` This allows easier testing and binary distribution without having to depend on external model directory. * Update attention layer (mlc-ai#1153) Existing dlight optimization only works for NT matmul, but not NN. As a result, the new `nn.Module`-based implementation, which uses NN matmul, fails compilation at HEAD for now. This PR fixes this issue by tweaking `k` to the preferred layout. The following commands now work with the new compilation pipeline: ```bash python -m mlc_chat.cli.compile --config llama2_7b --quantization q4f16_1 -o /tmp/1.so python -m mlc_chat.cli.compile --config llama2_13b --quantization q4f16_1 -o /tmp/1.so python -m mlc_chat.cli.compile --config llama2_70b --quantization q4f16_1 -o /tmp/1.so ``` Note that the quantization algorithm per se, `q4f16_1`, has not been implemented yet, meaning this code path is not yet ready for use so far. * Add batched Llama model definition using vLLM paged attention (mlc-ai#1134) * Add batched Llama model with vllm paged attention * update core.py * doc * minor * add e2e test * mv file * clean * Check if TVM has been built with USE_VLLM * update BuildArgs docstring * [Transform][Redo] Apply split_rotary optimization on prefill (mlc-ai#1125) Prior to this commit, the `transform.fuse_split_rotary_embedding` function was only applicable to the `decode` function of a Llama-type model. This was due to the sequence length being restricted to one, both in the pattern-match rule and in the `split_rotary` function, and the function being restricted to operate only on the `decode` function. This commit updates the `transform.fuse_split_rotary_embedding` pass to be a `tvm.ir.transform.Pass`, operating on all applicable matched in the `IRModule`. The `split_rotary` function is now produced as a fully-generic function, with static parameters substituted in afterwards. At this stage, the sequence length is retained as a dynamic parameter, such that it can be used by the `prefill` function. This commit reapplies the reverted commit mlc-ai#1033. The error in the previous implementation was in the definition of `rotary_embedding_offset`, which provided the `query_sequence_length` instead of `kv_sequence_length`. This was able to pass the validity tests described [here](mlc-ai#1058 (comment)), as these two sequence lengths are identical for the first call. * Apply rewrite for normal attention and MQA (mlc-ai#1138) Fixes a bug introduced in mlc-ai#1052, where use of the `--use-flash-attn-mqa` flag on a model that doesn't use MQA would prevent the use of CUTLASS attention at all. * [Rest] Fix emoji handling in Rest API. (mlc-ai#1142) * [Utility] Check for isinstance(exc, Exception) before entering pdb (mlc-ai#1095) This is a follow-up to mlc-ai#1017, which added a `--pdb` flag to enter a debugger on exit. This commit checks the type of the raised exception, and only enters the debugger if it is a subclass of `Exception`. This ensures that implementation-details, such as a thrown `SystemExit` or `KeyboardInterrupt`, do not cause an erroneous entry to pdb. * [Utils] Remove conversion to numpy array in utils.save_params (mlc-ai#1083) Prior to this commit, each parameter was converted to a numpy-owned array as part of a total size computation. This commit computes the size directly, removing the conversion. * [Fix][REST] Use lowered-cased "app" (mlc-ai#1159) * [Rest] Document emoji handling (mlc-ai#1160) Followup PR of mlc-ai#1142 to document the emoji handling. * Enable group quant transform with nn.Module (mlc-ai#1154) * Enable group quant transform with nn.Module This PR completes the group quantization support for `nn.Module` based model. * remove deprecated tests * Update * wip * remove deprecated test * fix lint * fix lint * fix lint --------- Co-authored-by: Junru Shao <[email protected]> * Misc Cleanups of Compilation Pipeline (mlc-ai#1165) * Support CUDA Multi-Arch Compilation (mlc-ai#1166) * [Bugfix] Cannot find global function `mlc.llm_chat_create` (mlc-ai#1167) * Fix RWKV Support (mlc-ai#1136) I successfully ran the rwkv-world-3b fp16 model on my Xiaomi phone. This PR is to fix a bug on the main branch where the rwkv model outputs only one word and then stop. ![image](https://github.com/mlc-ai/mlc-llm/assets/35585791/6514d6ef-c93c-4ad2-8e76-8ffa0663080f) * Auto updated submodule references * Fix Android app Permission denied error on Android 10 (mlc-ai#1175) Use scoped storage instead of Downloads directory Co-authored-by: Animesh Bohara <[email protected]> * [SLM] Fix group quantization (mlc-ai#1172) This PR fixes the group quantization and add related unit tests. * [Fix] TIR block name of dequantization (mlc-ai#1177) * [SLM][AutoLLM] Enable Command Line Weight Conversion (mlc-ai#1170) This PR enables weight conversion in command line. Sample command: `python3 -m mlc_chat.cli.convert_weight --config dist/models/llama-2-13b-chat-hf/ --quantization "q4f16_1" --output dist/test/` * [Fix][SLM] Update q4f16 quantization with the new mutator name rule (mlc-ai#1178) [Fix] Update q4f16 quantization with the new mutator name rule * [Model Support][SWA] Add support for sliding window attention for Mistral (mlc-ai#1087) * mistral base * Add sliding window mask making and its tests * Small changes for sliding window mask * Clean up mask making * Remove kv_seq_len * Add prefill chunking, handle max window size in SWA * Add interleave kv * Temporary fix for kv seq len * Pass in more shapes to SWA prefill and decode in runtime * mistral var fix * Small changes regarding shape passing * Small fix on chunk size * Add build args, fix mlc chat config dump * mistral system prompt --------- Co-authored-by: David Pissarra <[email protected]> Co-authored-by: David Pissarra <[email protected]> * Add Python API for Weight Conversion (mlc-ai#1182) This PR primarily does a major refactoring to introduce Python API that is consistent with the CLI API. Besides, it includes the following fixes and enhancements: - More info provided to `isort` for better formatting in `pyproject.toml`; - Print out the default value of all arguments in argparse command line; - Ensure `--device` is always available locally when doing weight conversion; - Add argument echoing in weight conversion to be consistent with its counterpart in compilation; - Add a consistency checker to make sure the shapes/dtypes of all tensors from weight conversion is consistent with compilation; - Echo the total size of parameters; - Better logging of each parameter's shape and dtype, and either or not its quantized; - More structure robustification, renaming `parameter/` to `loader/` to be more explicit about its intention; - Inline and remove `ParamQuantizer` into the loader to improve logging and the logic flow; - Always add instructions "Use `--xxx` to override" for any options that are auto detected to be more informative to end users; - Fix wrong shape calculation when quantizing `nn.Embedding`; - Fix wrong dtype calculation in group quantization when the input dtype is different from model dtype (e.g. "float32" in torch, but the model dtype in quantization is fp16 in `q4f16_1`); - Fix inconsistent param names in layers such as `GroupQuantizeLinear`; - Fix dtype inconsistency when a parameter is not quantized; - Fix existing unittests. * Merge `llama_config.CONFIG` into `MODEL_PRESETS` (mlc-ai#1188) * Merge llama_config.py into llama_model.py (mlc-ai#1189) * Add CodeLlama as part of model presets (mlc-ai#1190) * [Docs] Clarify zstd installation on Windows (mlc-ai#1191) * [Docs] Clarify zstd installation on Windows (mlc-ai#1196) Update zstd installation * Support overriding `--max-sequence-length` in command line (mlc-ai#1197) * [RestAPI] Added docs (mlc-ai#1193) Add docs for RestAPI Co-authored-by: Animesh Bohara <[email protected]> * [API] ```llm-vscode``` extension support (mlc-ai#1198) This PR enables ```llm-vscode``` extension API support for copilot-like code completion, following [HF's LSP](https://github.com/huggingface/llm-ls). Fully compatible with ```CodeLlama``` and ```starcoder``` on mlc-llm. - huggingface/llm-vscode#103 enhances extension user experience when used with mlc-llm rest api. Thanks @ pacman100, who came up with this on his latest blogpost: https://huggingface.co/blog/personal-copilot * [Fix] Use `fabs` as floating point abs function in C++ (mlc-ai#1202) * Integrating MLC runtime with the new compilation workflow (mlc-ai#1203) * [Fix] Remove Redundant Warnings (mlc-ai#1204) PR mlc-ai#1203 introduces some unnecessary and redundant logging messages. This PR gets them removed. * Try fix macOS build with picojson (mlc-ai#1206) The error message below ``` /usr/share/miniconda/envs/mlc-llm-build/conda-bld/mlc-chat-cli-nightly-package_1699286394016/work/3rdparty/tvm/3rdparty/picojson/picojson.h: In member function 'std::string picojson::value::to_str() const': /usr/share/miniconda/envs/mlc-llm-build/conda-bld/mlc-chat-cli-nightly-package_1699286394016/work/3rdparty/tvm/3rdparty/picojson/picojson.h:494:37: error: expected ')' before 'PRId64' 494 | SNPRINTF(buf, sizeof(buf), "%" PRId64, u_.int64_); | ~ ^~~~~~~ | ) /usr/share/miniconda/envs/mlc-llm-build/conda-bld/mlc-chat-cli-nightly-package_1699286394016/work/3rdparty/tvm/3rdparty/picojson/picojson.h:81:1: note: 'PRId64' is defined in header '<cinttypes>'; did you forget to '#include <cinttypes>'? 80 | #include <errno.h> +++ |+#include <cinttypes> 81 | #include <inttypes.h> ``` indicates that the `__STDC_FORMAT_MACROS` flag is not turned on for some reason. * Try fix macOS build with picojson again (mlc-ai#1207) Try fix macOS build with picojson * Auto updated submodule references * [Fix] Keep update-to-date with upstream API change (mlc-ai#1209) * Detect `mtriple` via LLVM (mlc-ai#1211) * Fix Python3.8 compatibility breakage (mlc-ai#1210) The breakage was resulting from newer syntax being used for type annotations, as part of mlc-ai#592. So long as `mlc_chat.interface.openai_api` wasn't imported, the breaking changes were not encountered. In mlc-ai#1107, the addition of `from .interface.openai_api import ChatMessage` caused this module to be imported, breaking compatibility of `mlc_chat.ChatModule` with Python3.8. This commit updates the type annotations to the supported syntax. * [Slim-LM] Enable loading from AWQ pre-quantized weight. (mlc-ai#1114) * [SLM] Enable loading from AWQ pre-quantized weight. * remove awq_loader.py * Update to the latest commit * Delete llama_parameter.py * update unittest * fix lint * upd * add Llama-2-7B-AWQ * [Bugfix] Fix Cannot import name '_LIB' from 'mlc_chat.base' (mlc-ai#1214) Fix Python API doc * [SLM] Support `q3f16_1` and `q4f32_1` (mlc-ai#1215) This PR supports the int3 and float32 group quantization, and fixes some minor issue in quantization impl and tests. * Make the Compilation Working E2E (mlc-ai#1218) * [Mistral][SWA] Add sliding window to metadata (mlc-ai#1217) Add sliding window to metadata, make smalle changes to invariants in runtime * Support for `chatml` format conversation (for TinyLlama-1.1B-Chat-v0.2) (mlc-ai#956) * added support for chatml format conversation * added template to factory * Add Rust Support for MLC-LLM (mlc-ai#1213) This PR introduces Rust language support for the MLC-LLM project, specifically targeting supporting the `ChatModule` interface. It utilizes the existing C++ implementation of MLC-LLM and leverages both TVM's C API and its Rust bindings. The `rust/examples/mlc_chat.rs` gives an example of how to create a `chat_module` and serve user prompts in Rust. The primary goal of this PR is to enrich the MLC-LLM ecosystem by offering a Rust interface that aligns with the current Python API. This enhancement will empower Rust developers to integrate MLC-LLM into their codebase and applications. **Followup PRs**: - Extend the feature set to achieve parity with the C++/Python interface. - Refine the Rust API, ensuring robustness. - Set up Rust CI if needed. * [Bugfix] Remove dependency on openai_api in chat module (mlc-ai#1222) Remove dependency on openai_api * Bake in RAM Usage in the Generated DSO (mlc-ai#1224) With this PR, the metadata in a DSO file using `vm["_metadata"]()` now have information about the upper bound RAM estimate on each function. As an example, the JSON string now is: ```json { "quantization": "q4f16_1", "model_type": "llama", "memory_usage": { "_initialize_effect": 0, "prefill": 136192, "softmax_with_temperature": 0, "decode": 218624 }, "params": [ {"name": "model.embed_tokens.q_weight", "shape": [32000, 512], "dtype": "uint32"}, {"name": "model.embed_tokens.q_scale", "shape": [32000, 128], "dtype": "float16"}, ... ] } ``` This helps the MLC runtime to better determine if a method is going to OOM and plan ahead, e.g. plan pre-allocated KVCache, accordingly. The idea originates from Ruihang's ancient PR that prints memory usage estimate as debugging information for demo purposes, and this PR further enhances it to IRModule-level attribute that can be used by the runtime. * [Fix] ChatModule python messages and offset types (mlc-ai#1220) small fixes * [Fix] Variable Upperbound Should be Injected before Build Pipeline (mlc-ai#1225) Now it shows a more reasonable upper bound for sequence length = 4096. ```json { "_initialize_effect": 0, "prefill": 3479311360, "softmax_with_temperature": 0, "decode": 34531840 } ``` Thanks Ruihang for helping with the fix! * [MultiGPU] Support pre-sharded model weights (mlc-ai#1096) * [Bugfix] Correct input shape for shard info function Prior to this commit, the sharding functions sharded axis converted from `orig_size * num_shards` to `orig_size // num_shards`. This commit updates the sharding functions to instead convert from `orig_size` to `orig_size // num_shards`. * [Bugfix] Include LegalizeOps in utils.convert_weights Prior to this commit, `utils.convert_weights` assumes that the parameter transformation module is already legalized, and uses no relax operations that require legalization. This commit adds a call to `relax.transform.LegalizeOps` to remove this assumption. * [MultiGPU] Cleanup create_shard_info_func - De-duplicate the `if param.shard_strategy == foo` if/else chain - Return a `tvm.IRModule` instead of modifying an existing module * Extract a ParamManager.optimize_transform_param_order method * Extract ParamManager.create_parameter_transformation call from convert_weights * Support writing of pre-sharded weights * Support execution using pre-sharded weights * Updating for review comments * fix typo * [AWQ] e2e awq-quantized model (mlc-ai#1229) * [SLM] Support `q0f16` and `q0f32` (mlc-ai#1228) This PR adds the support of `q0f16` and `q0f32`, and change `RMSNorm` to `nn.RMSNorm`. * [Core][Llama] Argument `max_vocab_size` and `max_batch_size` (mlc-ai#1076) This PR introduces the `max_vocab_size` and `max_batch_size` as two new compile arguments. The purpose is for better memory planning. Besides, this PR updates llama to make use of the two arguments. Other models are not changed yet. The default value for `max_vocab_size` is set to 40000, which I think is larger than the values of most models. The default value for `max_batch_size` is currently set as 256. It is possible that we update this value in the future to have a good default number. * [Llama] Support batched prefill (mlc-ai#1233) This PR supports the Llama modeling with batched prefill, which can bring higher throughput for the overall prefill process in serving. Besides, the PR splits the attention function used in batching settings into two separate ones, so that we do not dispatch to the prefill/decode attention functions at runtime. * [Core] Skip PrimExpr index int32 downcasting for batching (mlc-ai#1234) This PR makes the ForceNarrowIndexToInt32 to skip application when batching is enabled. The reason is because the flattened index of the KV cache append function may exceed the range of int32 when the cache is large. For example, in Llama-7b, when a KV cache supports more than 8192 tokens, the total cache size will be at least ``` 8192 * 2 (K/V) * 32 (layers) * 4096 = 2147483648, ``` which reaches the maximum int32 value. * Auto updated submodule references * Update index.rst (mlc-ai#1236) Fixed typo on tab:: Android * Update android.rst (mlc-ai#1237) On linux, TVM_NDK_CC environment variable should contain linux-x86_64 * Correct typo in cuda device name for rust chat model (mlc-ai#1241) * Generating mlc-chat-config.json (mlc-ai#1238) This PR finishes the last piece of new compilation pipeline, i.e. generation of `mlc-chat-config.json` and other configuration files. * Rename `--config` to `--model` and Consolidate CLI Messages (mlc-ai#1244) * Specify argument "dest" in argparse (mlc-ai#1245) * Add more stats during quantization (mlc-ai#1246) * ensure that max_gen_len is set properly in mlc_chat_config (mlc-ai#1249) Currently, `max_gen_len` defaults to 512 in `dump_mlc_chat_config`. However, the instantiations of `dump_mlc_chat_config` within `mlc-llm.build` currently omit the `max_gen_len` argument (even when it's specified in the HF config), so the default of 512 gets set for every `mlc-chat-config.json` that is created by `mlc-llm.build`. This PR fixes the issue. * [Fix] Memory usage statistics (mlc-ai#1252) * Introduce mlc_chat subcommands (mlc-ai#1251) This PR makes it possible to use subcommands of `mlc_chat` package to control quantization and compilation. Example: ```bash python -m mlc_chat convert_weight \ --model /models/Llama-2/hf/Llama-2-7b-chat-hf \ --quantization q4f16_1 \ -o ./dist/new-llama/ python -m mlc_chat gen_mlc_chat_config \ --model ./dist/models/Llama-2-7b-hf \ --quantization q4f16_1 \ --max-sequence-length 4096 \ --conv-template LM \ -o ./dist/new-llama \ python -m mlc_chat compile \ --model ./dist/models/Llama-2-7b-hf \ --quantization q4f16_1 \ --max-sequence-length 4096 \ -o ./dist/new-llama/llama.so ``` It slightly simplifies the workflow. * Update mlc-chat-config.json (mlc-ai#1254) This PR updates two fields: * `tokenizer_files`, which now non-existent files are removed from this list; * `model_preset_tag` added to `model_config`, which helps the system to conveniently identify if a model configuration is already part of the system's built-in model preset. * [Rust] Support multiple prompts (mlc-ai#1253) This PR introduces `Prompt` and `ChatMessage` structures, and enhances the `ChatModule` to generate tokens using either a single string (via `Prompt::String`) or a vector of `ChatMessage` (via `Prompt::MessageList`). An example is provided in [rust/examples/mlc_chat.rs](https://github.com/mlc-ai/mlc-llm/compare/main...YuchenJin:mlc-llm:multi-prompts?expand=1#diff-4ffa9349207c1df6ceeebe06a9afc8f2015000e031b39d677bbbe7e85ae2819b). Here is a snippet demonstrating the interface: ```rust let message1 = ChatMessage { role: "user".to_owned(), content: "suppose we already have projects llama, alpaca and vicuna, what do you think would be a great name for the next project?".to_string(), }; let message2 = ChatMessage { role: "assistant".to_owned(), content: "based on the previous projects, a possible name for the next project could be \"cervidae\" which is the scientific name for deer family. this name reflects the collaboration and teamwork involved in the development of the project, and also nods to the previous projects that have been developed by the team.".to_string(), }; let message3 = ChatMessage { role: "user".to_owned(), content: "Summarize our conversations".to_string(), }; let messages = vec![message1, message2, message3]; let prompt = Prompt::MessageList(messages); let output = cm.generate(&prompt, None).unwrap(); ``` * [UI] Correct "convert_weight_only" to "convert_weights_only" (mlc-ai#1227) * [UI] Correct "convert_weight_only" to "convert_weights_only" This is a frequent typo among multiple developers, as "weights" is typically plural. This commit updates the command-line-argument from `--convert-weight-only` to `--convert-weights-only`. For backwards compatibility, the original spelling is kept as an equivalent usage. * Update all use of "convert_weight_only" to "convert_weights_only" * Add a downloader from HuggingFace (mlc-ai#1258) This PR allows programmably downloading from HuggingFace to MLC's cache directory, which locates in `$HOME/.cache/mlc_chat/model_weights/` by default. This PR relies on Git to clone the metadata, and Python's requests library to fetch concrete weights as large files instead of the less reliable Git LFS. The example demonstrates downloading the 4-bit quantized Llama2-7B model: ```python from mlc_chat.support.download import download_mlc_weights download_mlc_weights("HF://mlc-ai/mlc-chat-Llama-2-7b-chat-hf-q4f16_1") ``` Screenshot: <img width="1913" alt="image" src="https://github.com/mlc-ai/mlc-llm/assets/22515877/3ac50594-4971-4216-bb17-47710b4af1dd"> * [Fix] Add prefix_tokens to `ConvConfig` in Python to match C++ implementation (mlc-ai#1256) During my Rust implementation of the project, I noticed an inconsistency between the Python and C++ implementations of `ConvConfig`. Specifically, the Python version lacks the `prefix_tokens` field, which is present in the C++ version.: https://github.com/mlc-ai/mlc-llm/blob/5e02cacd8ebba2e0206d5a447225a137de2dac0d/cpp/conversation.h#L69-L70. This can cause the [`_load_json_override`](https://github.com/mlc-ai/mlc-llm/blob/5e02cacd8ebba2e0206d5a447225a137de2dac0d/python/mlc_chat/chat_module.py#L1062C26-L1062C26) fails to work in the `_prefill` function. I think a simple unit test would help, I'd like to add a regression test if the CI has been set up. * [nn.Module] Mistral implementation (mlc-ai#1230) * Add mistral high level structure * Small config change * Now work with compile, mistral inference logic left * Add masking, cache_len, kv_seq_len; only attention forward left * fix mistral override naming * `interleave_kv` implementation * lint fix * move rolling buffer cache impl to mlc-llm * isort fix * nn.module implementation - reorganize structure * Update python/mlc_chat/cli/compile.py Co-authored-by: Junru Shao <[email protected]> * lint fix --------- Co-authored-by: Charlie Ruan <[email protected]> Co-authored-by: Junru Shao <[email protected]> * Add `mlc_chat.__main__` as command line entrypoint (mlc-ai#1263) This PR makes it possible to invoke mlc_chat subcommands directly. Previously one has to use `python -m` as the prefix to invoke `mlc_chat`: ```bash python -m mlc_chat compile \ --model /models/Llama-2-7b-chat-hf \ --quantization q4f16_1 \ --max-sequence-length 4096 \ -o ./llama.so ``` This PR makes is possible to use it without the `python -m` prefix: ```bash mlc_chat compile \ --model /models/Llama-2-7b-chat-hf \ --quantization q4f16_1 \ --max-sequence-length 4096 \ -o ./llama.so ``` * [Rust] Improve ergonomics of `generate` function in `ChatModule` (mlc-ai#1262) Following PR mlc-ai#1253, I think ergonomics of the `generate` function of `ChatModule` can be improved (given it's an important public-facing API). This PR simplifies the function's usage by implementing the `From` trait for the `Prompt` enum. Also updated the example code. Now the interface changes to: ```rust /// Single prompt case: cm.generate("what is the meaning of life?", None) /// Multiple prompt case: let messages: Vec<ChatMessage> = vec![message1, message2, message3]; let output = cm.generate(messages, None).unwrap(); ``` * [Fix] mistral `max_gen_len` (mlc-ai#1264) * Rename `max-sequence-length` to `context-window-size` (mlc-ai#1265) "Context window" is a terminology better aligned with LLM world. Whenever a new model is trained, it is one of the most important metrics that people care about. Therefore, I'd love to switch it over sooner than later, before "mlc_chat compile" becomes mature and documented. * Auto updated submodule references * Fix group quantization shape infer (mlc-ai#1273) This PR fixes the shape infer for group quantization. * Continuous Model Delivery (mlc-ai#1272) This PR provides a script that automatically quantizes models from HuggingFace using various quantization formats as specified. Example: When being provided the following JSON file: ```json { "destination": "{username}/{model_id}-{quantization}", # Name of HF repo "default_quantization": ["q0f16", "q0f32", "q3f16_1", "q4f16_1", "q4f32_1"], "tasks": [ { "model_id": "Llama-2-7b-hf", "model": "/models/Llama-2-7b-hf", # Can be HF URL or a local path "context_window_size": 4096, "conv_template": "LM", "quantization": [ { "format": "q4f16_awq", "model": "https://huggingface.co/TheBloke/Llama-2-7B-AWQ", # Overriding default `source` "source_format": "awq" } ] } ] } ``` The script will automatically run quantization and upload them to the following repos: - https://huggingface.co/junrushao/Llama-2-7b-hf-q0f16 - https://huggingface.co/junrushao/Llama-2-7b-hf-q0f32 - https://huggingface.co/junrushao/Llama-2-7b-hf-q3f16_1 - https://huggingface.co/junrushao/Llama-2-7b-hf-q4f16_1 - https://huggingface.co/junrushao/Llama-2-7b-hf-q4f32_1 - https://huggingface.co/junrushao/Llama-2-7b-hf-q4f16_awq * Auto updated submodule references * Enhance Model Delivery (mlc-ai#1283) This PR introduces a few enhancements: - Allow to override temporary path via environment variable `MLC_TEMP_DIR`; - Add a 10-time retry when uploading the quantized weights to HuggingFace Hub. It could fail at times; - Echo the commands being used to quantize the models in `logs.txt`; - Fix a compatibility issue when pulling individual weights down from HuggingFace Hub in Git LFS. * add python, rest api test (mlc-ai#1278) * add python, rest api test * remove mistral, fix pylint * fix pylint requests import error * Enable Jenkins CI (mlc-ai#1292) * fix * Update android.rst (mlc-ai#1289) This fix enables default models in app-config.json to get shown "downloaded" in model list via with adb push method for the default models * more fix * Consolidate Logics for GPU Detection (mlc-ai#1297) This PR unifies automatic device detection logic by using `mlc_chat.support.auto_device`, which comes with detailed logging and fallback mechanisms. * [CI] Fix lint concurrent clone issue (mlc-ai#1299) This PR fixes the broken CI due to different tasks sharing the same workspace. * Auto updated submodule references * [Feature] Prefill chunking for non-SWA models (mlc-ai#1280) * generalize `prefill-chunk-size` * renaming `cache_len` to `rolling_cache_len` * [nn.Module] generalize `prefill_chunk_size` * quick fix * lint fix * check sw with chunking * fix `_attach_variable_bounds` * update config from lib metadata * cleanup cleanup * metadata fix * Compatible with chatglm (mlc-ai#979) compatible for chatglm * Add q4/q8_ft_group quantization mode (mlc-ai#1284) * Add q4/q8_ft_group quantization mode * Update submodule * fix * restore multi gpu support for FT quant --------- Co-authored-by: David Pissarra <[email protected]> Co-authored-by: Roee Shenberg <[email protected]> Co-authored-by: Eric Lunderberg <[email protected]> Co-authored-by: Yaxing Cai <[email protected]> Co-authored-by: Charlie Ruan <[email protected]> Co-authored-by: Bohan Hou <[email protected]> Co-authored-by: yongjer <[email protected]> Co-authored-by: Jeethu Rao <[email protected]> Co-authored-by: Junru Shao <[email protected]> Co-authored-by: Ruihang Lai <[email protected]> Co-authored-by: Denise Kutnick <[email protected]> Co-authored-by: Lesheng Jin <[email protected]> Co-authored-by: Junru Shao <[email protected]> Co-authored-by: Sunghyun Park <[email protected]> Co-authored-by: “Sunghyun <[email protected]> Co-authored-by: Rick Zhou <[email protected]> Co-authored-by: Varshith Bathini <[email protected]> Co-authored-by: Varshith <[email protected]> Co-authored-by: Tianqi Chen <[email protected]> Co-authored-by: Git bot <[email protected]> Co-authored-by: SingLi <[email protected]> Co-authored-by: Kartik Khandelwal <[email protected]> Co-authored-by: Goutham Tamilselvan <[email protected]> Co-authored-by: S A G A R <[email protected]> Co-authored-by: Yuchen Jin <[email protected]> Co-authored-by: DavidSharma <[email protected]> Co-authored-by: fennecJ <[email protected]> Co-authored-by: Xiyou Zhou <[email protected]> Co-authored-by: Xiaoyu Zhang <[email protected]> Co-authored-by: Animesh Bohara <[email protected]> Co-authored-by: Animesh Bohara <[email protected]> Co-authored-by: David Pissarra <[email protected]> Co-authored-by: Antonio Calatrava <[email protected]> Co-authored-by: Aman Kushwaha <[email protected]> Co-authored-by: Malcolm Ramsay <[email protected]> Co-authored-by: Denise Kutnick <[email protected]> Co-authored-by: Charlie Ruan <[email protected]> Co-authored-by: Siyuan Feng <[email protected]> Co-authored-by: Masahiro Masuda <[email protected]> Co-authored-by: ChaoQin <[email protected]> Co-authored-by: Wuwei Lin <[email protected]>
1 parent baa4fa6 commit 4e5595c

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

67 files changed

+2936
-935
lines changed

.github/workflows/lint.yml

Lines changed: 0 additions & 87 deletions
This file was deleted.

ci/jenkinsfile.groovy

Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
// Licensed to the Apache Software Foundation (ASF) under one
2+
// or more contributor license agreements. See the NOTICE file
3+
// distributed with this work for additional information
4+
// regarding copyright ownership. The ASF licenses this file
5+
// to you under the Apache License, Version 2.0 (the
6+
// "License"); you may not use this file except in compliance
7+
// with the License. You may obtain a copy of the License at
8+
//
9+
// http://www.apache.org/licenses/LICENSE-2.0
10+
//
11+
// Unless required by applicable law or agreed to in writing,
12+
// software distributed under the License is distributed on an
13+
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14+
// KIND, either express or implied. See the License for the
15+
// specific language governing permissions and limitations
16+
// under the License.
17+
18+
import org.jenkinsci.plugins.pipeline.modeldefinition.Utils
19+
20+
image = 'mlcaidev/ci-cpu:caab922'
21+
docker_run = "bash ci/bash.sh ${image}"
22+
23+
def per_exec_ws(folder) {
24+
return "workspace/exec_${env.EXECUTOR_NUMBER}/" + folder
25+
}
26+
27+
def init_git(submodule = false) {
28+
checkout scm
29+
if (submodule) {
30+
retry(5) {
31+
timeout(time: 2, unit: 'MINUTES') {
32+
sh(script: 'git submodule update --init --recursive -f', label: 'Update git submodules')
33+
}
34+
}
35+
}
36+
}
37+
38+
stage('Lint') {
39+
parallel(
40+
'isort': {
41+
node('CPU-SMALL') {
42+
ws(per_exec_ws('mlc-llm-lint-isort')) {
43+
init_git()
44+
sh(script: "ls", label: 'debug')
45+
sh(script: "${docker_run} conda env export --name ci-lint", label: 'Checkout version')
46+
sh(script: "${docker_run} bash ci/task/isort.sh", label: 'Lint')
47+
}
48+
}
49+
},
50+
'black': {
51+
node('CPU-SMALL') {
52+
ws(per_exec_ws('mlc-llm-lint-black')) {
53+
init_git()
54+
sh(script: "ls", label: 'debug')
55+
sh(script: "${docker_run} conda env export --name ci-lint", label: 'Checkout version')
56+
sh(script: "${docker_run} bash ci/task/black.sh", label: 'Lint')
57+
}
58+
}
59+
},
60+
'mypy': {
61+
node('CPU-SMALL') {
62+
ws(per_exec_ws('mlc-llm-lint-mypy')) {
63+
init_git()
64+
sh(script: "ls", label: 'debug')
65+
sh(script: "${docker_run} conda env export --name ci-lint", label: 'Checkout version')
66+
sh(script: "${docker_run} bash ci/task/mypy.sh", label: 'Lint')
67+
}
68+
}
69+
},
70+
'pylint': {
71+
node('CPU-SMALL') {
72+
ws(per_exec_ws('mlc-llm-lint-pylint')) {
73+
init_git()
74+
sh(script: "ls", label: 'debug')
75+
sh(script: "${docker_run} conda env export --name ci-lint", label: 'Checkout version')
76+
sh(script: "${docker_run} bash ci/task/pylint.sh", label: 'Lint')
77+
}
78+
}
79+
},
80+
'clang-format': {
81+
node('CPU-SMALL') {
82+
ws(per_exec_ws('mlc-llm-lint-clang-format')) {
83+
init_git()
84+
sh(script: "ls", label: 'debug')
85+
sh(script: "${docker_run} conda env export --name ci-lint", label: 'Checkout version')
86+
sh(script: "${docker_run} bash ci/task/clang-format.sh", label: 'Lint')
87+
}
88+
}
89+
},
90+
)
91+
}

ci/task/mypy.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,4 +8,4 @@ export PYTHONPATH="./python:$PYTHONPATH"
88

99
set -x
1010

11-
mypy ./python/ ./tests/python/
11+
mypy --install-types --non-interactive ./python/ ./tests/python/

ci/task/pylint.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ export PYTHONPATH="./python:$PYTHONPATH"
99
set -x
1010

1111
# TVM Unity is a dependency to this testing
12-
pip install --quiet --pre -U -f https://mlc.ai/wheels mlc-ai-nightly
12+
pip install --quiet --pre -U -f https://mlc.ai/wheels mlc-ai-nightly requests
1313

1414
pylint --jobs $NUM_THREADS ./python/
1515
pylint --jobs $NUM_THREADS --recursive=y ./tests/python/

cpp/llm_chat.cc

Lines changed: 28 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -317,25 +317,33 @@ class LLMChat {
317317
return os.str();
318318
}
319319

320-
bool UpdateMaxWindowSizeFromMetadata() {
320+
void UpdateConfigFromMetadata() {
321321
if (ft_.use_disco) {
322-
return false;
323-
}
324-
if (this->sliding_window_ != -1) {
325-
return false;
322+
return;
326323
}
324+
327325
PackedFunc fget_metadata = ft_.mod_get_func("get_metadata");
328326
if (fget_metadata == nullptr) {
329-
return false;
327+
return;
330328
}
331329
ObjectRef ret = fget_metadata();
332330
std::string metadata_str = std::string(Downcast<String>(ret));
333331
picojson::value metadata_info;
334332
picojson::parse(metadata_info, std::string(metadata_str));
335333
auto metadata = metadata_info.get<picojson::object>();
334+
336335
ICHECK(metadata["max_window_size"].is<int64_t>());
337336
max_window_size_ = std::min(max_window_size_, metadata["max_window_size"].get<int64_t>());
338-
return true;
337+
338+
if (metadata.count("prefill_chunk_size")) {
339+
ICHECK(metadata["prefill_chunk_size"].is<int64_t>());
340+
prefill_chunk_size_ =
341+
std::min(prefill_chunk_size_, metadata["prefill_chunk_size"].get<int64_t>());
342+
}
343+
if (metadata.count("sliding_window")) {
344+
ICHECK(metadata["sliding_window"].is<int64_t>());
345+
sliding_window_ = std::min(sliding_window_, metadata["sliding_window"].get<int64_t>());
346+
}
339347
}
340348

341349
/*!
@@ -410,21 +418,12 @@ class LLMChat {
410418
<< "Cannot specify both sliding_window and max_window_size.";
411419
this->sliding_window_ = config["sliding_window"].get<int64_t>();
412420
CHECK(this->sliding_window_ > 0) << "Sliding window size needs to be positive";
413-
CHECK(config.count("sliding_window_chunk_size"))
421+
CHECK(config.count("prefill_chunk_size"))
414422
<< "Need to specify chunk size if using sliding window attention.";
415423
}
416-
if (config.count("sliding_window_chunk_size")) {
417-
CHECK(config["sliding_window_chunk_size"].is<int64_t>());
418-
this->sliding_window_chunk_size_ = config["sliding_window_chunk_size"].get<int64_t>();
419-
CHECK(this->sliding_window_chunk_size_ > 0)
420-
<< "Sliding window chunk size needs to be positive";
421-
CHECK(config.count("sliding_window")) << "Need to specify sliding window size.";
422-
}
423-
if (config.count("model_name")) {
424-
CHECK(config["model_name"].is<std::string>());
425-
this->model_name_ = config["model_name"].get<std::string>();
426-
} else {
427-
CHECK(partial_update) << "Key \"model_name\" not found.";
424+
if (config.count("prefill_chunk_size")) {
425+
CHECK(config["prefill_chunk_size"].is<int64_t>());
426+
this->prefill_chunk_size_ = config["prefill_chunk_size"].get<int64_t>();
428427
}
429428
if (config.count("top_p")) {
430429
CHECK(config["top_p"].is<double>());
@@ -513,8 +512,8 @@ class LLMChat {
513512
// so there is no explicit abi dependency on these extra
514513
// classes other than basic tvm runtime.
515514
this->ft_.Init(reload_lib, device_, this->num_shards_);
515+
UpdateConfigFromMetadata();
516516
if (this->sliding_window_ == -1) {
517-
UpdateMaxWindowSizeFromMetadata();
518517
CHECK(max_window_size_ != std::numeric_limits<int64_t>::max())
519518
<< "Key \"max_window_size\" not found.";
520519
}
@@ -807,9 +806,8 @@ class LLMChat {
807806
if (ft_.use_disco) {
808807
LOG(FATAL) << "NotImplementedError: Distributed inference is not supported for this model";
809808
}
810-
if (this->sliding_window_ != -1) {
811-
LOG(FATAL)
812-
<< "NotImplementedError: Sliding window attention does not support separate embedding";
809+
if (this->prefill_chunk_size_ != -1) {
810+
LOG(FATAL) << "NotImplementedError: Separate embedding does not support chunking";
813811
}
814812
NDArray embedding = Downcast<NDArray>(
815813
EmbedStep(inp, append_conversation, place_in_prompt, generation_config_str));
@@ -832,10 +830,10 @@ class LLMChat {
832830

833831
int32_t new_seq_len = total_seq_len_;
834832
NDArray logits_on_device;
835-
if (this->sliding_window_ != -1) {
836-
// Use chunking if we use sliding window attention (see Mistral paper figure 3).
837-
for (int64_t begin = 0; begin < token_len; begin += this->sliding_window_chunk_size_) {
838-
int64_t end = std::min(token_len, begin + this->sliding_window_chunk_size_);
833+
if (this->prefill_chunk_size_ > 0) {
834+
// Perform chunking.
835+
for (int64_t begin = 0; begin < token_len; begin += this->prefill_chunk_size_) {
836+
int64_t end = std::min(token_len, begin + this->prefill_chunk_size_);
839837
std::vector<int32_t> chunk =
840838
std::vector<int32_t>(prompt_tokens.begin() + begin, prompt_tokens.begin() + end);
841839
new_seq_len += static_cast<int64_t>(chunk.size());
@@ -844,6 +842,7 @@ class LLMChat {
844842
ICHECK_EQ(new_seq_len, total_seq_len_ + token_len) << "Expect chunking process all tokens";
845843
} else {
846844
// Otherwise, prefill entire prompt at once.
845+
CHECK(sliding_window_ == -1) << "Expect chunking with sliding window attention";
847846
new_seq_len += token_len;
848847
logits_on_device = this->ForwardTokens(prompt_tokens, new_seq_len);
849848
}
@@ -1356,16 +1355,14 @@ class LLMChat {
13561355
//----------------------------
13571356
// Conversation
13581357
//----------------------------
1359-
// model name
1360-
std::string model_name_;
13611358
// conversation
13621359
Conversation conversation_;
13631360
// total sequence len,
13641361
int64_t total_seq_len_{0};
13651362
// max window size, mean and max generation length, sliding window
13661363
// If we use sliding window, max window size is its default max() value
13671364
int64_t max_window_size_{std::numeric_limits<int64_t>::max()}, mean_gen_len_{128},
1368-
max_gen_len_{512}, sliding_window_{-1}, sliding_window_chunk_size_{-1};
1365+
max_gen_len_{512}, sliding_window_{-1}, prefill_chunk_size_{-1};
13691366
// size of the vocab table
13701367
int64_t vocab_size_;
13711368
// number of shards in distributed inference

docs/deploy/android.rst

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ Prerequisite
3333
TVM_NDK_CC: $ANDROID_NDK/toolchains/llvm/prebuilt/darwin-x86_64/bin/aarch64-linux-android24-clang
3434
# Example on Windows
3535
ANDROID_NDK: $HOME/Library/Android/sdk/ndk/25.2.9519653
36-
TVM_NDK_CC: $ANDROID_NDK/toolchains/llvm/prebuilt/darwin-x86_64/bin/aarch64-linux-android24-clang
36+
TVM_NDK_CC: $ANDROID_NDK/toolchains/llvm/prebuilt/linux-x86_64/bin/aarch64-linux-android24-clang
3737
3838
**JDK**, such as OpenJDK >= 17, to compile Java bindings of TVM Unity runtime. It could be installed via Homebrew on macOS, apt on Ubuntu or other package managers. Set up the following environment variable:
3939

@@ -164,6 +164,6 @@ Instructions have been provided to build an Android App with MLC LLM in previous
164164
.. code-block:: bash
165165
166166
adb install android/MLCChat/app/release/app-release.apk
167-
adb push dist/${MODEL_NAME}-${QUANTIZATION}/params /data/local/tmp/${MODEL_NAME}/
167+
adb push dist/${MODEL_NAME}-${QUANTIZATION}/params /data/local/tmp/${MODEL_NAME}-${QUANTIZATION}/
168168
adb shell "mkdir -p /storage/emulated/0/Android/data/ai.mlc.mlcchat/files/"
169-
adb shell "mv /data/local/tmp/${MODEL_NAME} /storage/emulated/0/Android/data/ai.mlc.mlcchat/files/${MODEL_NAME}"
169+
adb shell "mv /data/local/tmp/${MODEL_NAME}-${QUANTIZATION} /storage/emulated/0/Android/data/ai.mlc.mlcchat/files/"

docs/index.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -151,7 +151,7 @@ It is recommended to have at least 6GB free VRAM to run it.
151151
- Redmi Note 12 Pro with Snapdragon 685
152152
- Google Pixel phones
153153

154-
**Tutorial and source code**. The source code of the iOS app is fully `open source <https://github.com/mlc-ai/mlc-llm/tree/main/android>`__,
154+
**Tutorial and source code**. The source code of the android app is fully `open source <https://github.com/mlc-ai/mlc-llm/tree/main/android>`__,
155155
and a :doc:`tutorial <deploy/android>` is included in documentation.
156156

157157
.. figure:: https://blog.mlc.ai/img/android/android-recording.gif

mlc_llm/build.py

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -40,17 +40,18 @@ def main():
4040
# Post processing of arguments
4141
parsed_args = core._parse_args(parsed_args) # pylint: disable=protected-access
4242

43-
# if num_shard>1 without -convert-weight-only or --build-model-only, we implicitly run it sequentially
44-
if parsed_args.num_shards > 1 and not (parsed_args.build_model_only or parsed_args.convert_weight_only):
43+
# if num_shard>1 without -convert-weight-only or --build-model-only, we implicitly run it sequentially
44+
if parsed_args.num_shards > 1 and not (parsed_args.build_model_only or parsed_args.convert_weights_only):
4545
parsed_args.build_model_only = True
46-
parsed_args.convert_weight_only = False # just to be explicit
46+
parsed_args.convert_weights_only = False # just to be explicit
4747
core.build_model_from_args(parsed_args)
48-
48+
4949
parsed_args.build_model_only = False
50-
parsed_args.convert_weight_only = True
50+
parsed_args.convert_weights_only = True
5151
core.build_model_from_args(parsed_args)
5252
else:
5353
core.build_model_from_args(parsed_args)
54-
54+
55+
5556
if __name__ == "__main__":
5657
main()

0 commit comments

Comments
 (0)