Skip to content

Commit 206103b

Browse files
authored
[Docs] Add doc for max and mean gen len, shift factor; and buildArgs (mlc-ai#1119)
* Add doc for max and mean gen len, shift factor * Update python docs for BuildArgs
1 parent 488017d commit 206103b

File tree

2 files changed

+36
-3
lines changed

2 files changed

+36
-3
lines changed

mlc_llm/core.py

Lines changed: 25 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -79,11 +79,35 @@ class BuildArgs:
7979
Build with separated embedding layer, only applicable to LlaMa. This
8080
feature is in testing stage, and will be formally replaced after massive
8181
overhaul of embedding feature for all models and use cases.
82+
cc_path: str
83+
``/path/to/cross_compiler_path``; currently only used for cross-compile
84+
for nvidia/jetson device.
85+
use_safetensors: bool
86+
Specifies whether to use ``.safetensors`` instead of the default ``.bin``
87+
when loading in model weights.
8288
enable_batching: bool
8389
Build the model for batched inference.
8490
This is a temporary flag used to control the model execution flow in single-
8591
sequence and batching settings for now. We will eventually merge two flows
8692
in the future and remove this flag then.
93+
no_cutlass_attn: bool
94+
Disable offloading attention operations to CUTLASS.
95+
no_cutlass_norm: bool
96+
Disable offloading layer and RMS norm operations to CUTLASS.
97+
no_cublas: bool
98+
Disable the step that offloads matmul to cuBLAS. Without this flag,
99+
matmul will be offloaded to cuBLAS if quantization mode is ``q0f16`` or
100+
``q0f32``, target is CUDA and TVM has been built with cuBLAS enabled.
101+
use_cuda_graph: bool
102+
Specifies whether to enable CUDA Graph for the decoder. MLP and QKV
103+
projection between two attention layers are put into a graph.
104+
num_shards: int
105+
Number of shards to split the model into in tensor parallelism multi-gpu
106+
inference. Only useful when ``build_model_only`` is set.
107+
use_flash_attn_mqa: bool
108+
Offload multi-query attention workload to Flash Attention.
109+
pdb: bool
110+
If set, drop into a pdb debugger on error.
87111
"""
88112
model: str = field(
89113
default="auto",
@@ -217,7 +241,7 @@ class BuildArgs:
217241
"help": (
218242
"Disable the step that offloads matmul to cuBLAS. Without this flag, "
219243
"matmul will be offloaded to cuBLAS if quantization mode is q0f16 or q0f32, "
220-
"target is CUDA and TVM has been built with cuBLAS enbaled."
244+
"target is CUDA and TVM has been built with cuBLAS enabled."
221245
),
222246
"action": "store_true",
223247
},

python/mlc_chat/chat_module.py

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -91,7 +91,7 @@ class ChatConfig:
9191
:class:`mlc_chat.ChatModule` instance to override the default setting in
9292
``mlc-chat-config.json`` under the model folder.
9393
94-
Since the configuraiton is partial, everything will be ``Optional``.
94+
Since the configuration is partial, everything will be ``Optional``.
9595
9696
Note that we will exploit this class to also represent ``mlc-chat-config.json``
9797
during intermediate processing.
@@ -131,14 +131,19 @@ class ChatConfig:
131131
For additional information on top-p sampling, please refer to this blog
132132
post: https://huggingface.co/blog/how-to-generate#top-p-nucleus-sampling.
133133
mean_gen_len : Optional[int]
134+
The approximated average number of generated tokens in each round. Used
135+
to determine whether the maximum window size would be exceeded.
134136
max_gen_len : Optional[int]
137+
The maximum number of tokens to be generated in each round. Would simply
138+
stop generating after this number is exceeded.
135139
shift_fill_factor : Optional[float]
140+
The fraction of maximum window size to shift when it is exceeded.
136141
tokenizer_files : Optional[List[str]]
137142
List of tokenizer files of the model.
138143
conv_config : Optional[ConvConfig]
139144
The partial overriding configuration for conversation template. Will first
140145
load the predefined template with the name specified in ``conv_template``
141-
and then override some of the configuraitons specified in ``conv_config``.
146+
and then override some of the configurations specified in ``conv_config``.
142147
model_category : Optional[str]
143148
The category of the model's architecture (e.g. ``llama``, ``gpt_neox``, ``rwkv``).
144149
model_name : Optional[str]
@@ -216,7 +221,11 @@ class GenerationConfig:
216221
For additional information on top-p sampling, please refer to this blog
217222
post: https://huggingface.co/blog/how-to-generate#top-p-nucleus-sampling.
218223
mean_gen_len : Optional[int]
224+
The approximated average number of generated tokens in each round. Used
225+
to determine whether the maximum window size would be exceeded.
219226
max_gen_len : Optional[int]
227+
The maximum number of tokens to be generated in each round. Would simply
228+
stop generating after this number is exceeded.
220229
"""
221230

222231
temperature: Optional[float] = None

0 commit comments

Comments
 (0)