@@ -74,12 +74,136 @@ The REST API provides the following endpoints:
74
74
75
75
.. http :get :: /v1/completions
76
76
77
+ ------------------------------------------------
78
+
77
79
Get a completion from MLC-Chat using a prompt.
78
80
81
+ **Request body **
82
+
83
+ **model **: *str * (required)
84
+ The model folder after compiling with MLC-LLM build process. The parameter
85
+ can either be the model name with its quantization scheme
86
+ (e.g. ``Llama-2-7b-chat-hf-q4f16_1 ``), or a full path to the model
87
+ folder. In the former case, we will use the provided name to search
88
+ for the model folder over possible paths.
89
+ **prompt **: *str * (required)
90
+ A list of chat messages. The last message should be from the user.
91
+ **stream **: *bool * (optional)
92
+ Whether to stream the response. If ``True ``, the response will be streamed
93
+ as the model generates the response. If ``False ``, the response will be
94
+ returned after the model finishes generating the response.
95
+ **temperature **: *float * (optional)
96
+ The temperature applied to logits before sampling. The default value is
97
+ ``0.7 ``. A higher temperature encourages more diverse outputs, while a
98
+ lower temperature produces more deterministic outputs.
99
+ **top_p **: *float * (optional)
100
+ This parameter determines the set of tokens from which we sample during
101
+ decoding. The default value is set to ``0.95 ``. At each step, we select
102
+ tokens from the minimal set that has a cumulative probability exceeding
103
+ the ``top_p `` parameter.
104
+
105
+ For additional information on top-p sampling, please refer to this blog
106
+ post: https://huggingface.co/blog/how-to-generate#top-p-nucleus-sampling.
107
+ **repetition_penalty **: *float * (optional)
108
+ The repetition penalty controls the likelihood of the model generating
109
+ repeated texts. The default value is set to ``1.0 ``, indicating that no
110
+ repetition penalty is applied. Increasing the value reduces the
111
+ likelihood of repeat text generation. However, setting a high
112
+ ``repetition_penalty `` may result in the model generating meaningless
113
+ texts. The ideal choice of repetition penalty may vary among models.
114
+
115
+ For more details on how repetition penalty controls text generation, please
116
+ check out the CTRL paper (https://arxiv.org/pdf/1909.05858.pdf).
117
+ **presence_penalty **: *float * (optional)
118
+ Positive values penalize new tokens if they are already present in the text so far,
119
+ decreasing the model's likelihood to repeat tokens.
120
+ **frequency_penalty **: *float * (optional)
121
+ Positive values penalize new tokens based on their existing frequency in the text so far,
122
+ decreasing the model's likelihood to repeat tokens.
123
+ **mean_gen_len **: *int * (optional)
124
+ The approximated average number of generated tokens in each round. Used
125
+ to determine whether the maximum window size would be exceeded.
126
+ **max_gen_len **: *int * (optional)
127
+ This parameter determines the maximum length of the generated text. If it is
128
+ not set, the model will generate text until it encounters a stop token.
129
+
130
+ ------------------------------------------------
131
+
132
+ **Returns **
133
+ If ``stream `` is set to ``False ``, the response will be a ``CompletionResponse `` object.
134
+ If ``stream `` is set to ``True ``, the response will be a stream of ``CompletionStreamResponse `` objects.
135
+
136
+
79
137
.. http :get :: /v1/chat/completions
80
138
139
+ ------------------------------------------------
140
+
81
141
Get a response from MLC-Chat using a prompt, either with or without streaming.
82
142
143
+ **Request body **
144
+
145
+ **model **: *str * (required)
146
+ The model folder after compiling with MLC-LLM build process. The parameter
147
+ can either be the model name with its quantization scheme
148
+ (e.g. ``Llama-2-7b-chat-hf-q4f16_1 ``), or a full path to the model
149
+ folder. In the former case, we will use the provided name to search
150
+ for the model folder over possible paths.
151
+ **messages **: *list[ChatMessage] * (required)
152
+ A list of chat messages. The last message should be from the user.
153
+ **stream **: *bool * (optional)
154
+ Whether to stream the response. If ``True ``, the response will be streamed
155
+ as the model generates the response. If ``False ``, the response will be
156
+ returned after the model finishes generating the response.
157
+ **temperature **: *float * (optional)
158
+ The temperature applied to logits before sampling. The default value is
159
+ ``0.7 ``. A higher temperature encourages more diverse outputs, while a
160
+ lower temperature produces more deterministic outputs.
161
+ **top_p **: *float * (optional)
162
+ This parameter determines the set of tokens from which we sample during
163
+ decoding. The default value is set to ``0.95 ``. At each step, we select
164
+ tokens from the minimal set that has a cumulative probability exceeding
165
+ the ``top_p `` parameter.
166
+
167
+ For additional information on top-p sampling, please refer to this blog
168
+ post: https://huggingface.co/blog/how-to-generate#top-p-nucleus-sampling.
169
+ **repetition_penalty **: *float * (optional)
170
+ The repetition penalty controls the likelihood of the model generating
171
+ repeated texts. The default value is set to ``1.0 ``, indicating that no
172
+ repetition penalty is applied. Increasing the value reduces the
173
+ likelihood of repeat text generation. However, setting a high
174
+ ``repetition_penalty `` may result in the model generating meaningless
175
+ texts. The ideal choice of repetition penalty may vary among models.
176
+
177
+ For more details on how repetition penalty controls text generation, please
178
+ check out the CTRL paper (https://arxiv.org/pdf/1909.05858.pdf).
179
+ **presence_penalty **: *float * (optional)
180
+ Positive values penalize new tokens if they are already present in the text so far,
181
+ decreasing the model's likelihood to repeat tokens.
182
+ **frequency_penalty **: *float * (optional)
183
+ Positive values penalize new tokens based on their existing frequency in the text so far,
184
+ decreasing the model's likelihood to repeat tokens.
185
+ **mean_gen_len **: *int * (optional)
186
+ The approximated average number of generated tokens in each round. Used
187
+ to determine whether the maximum window size would be exceeded.
188
+ **max_gen_len **: *int * (optional)
189
+ This parameter determines the maximum length of the generated text. If it is
190
+ not set, the model will generate text until it encounters a stop token.
191
+ **n **: *int * (optional)
192
+ This parameter determines the number of text samples to generate. The default
193
+ value is ``1 ``. Note that this parameter is only used when ``stream `` is set to
194
+ ``False ``.
195
+ **stop **: *str * or *list[str] * (optional)
196
+ When ``stop `` is encountered, the model will stop generating output.
197
+ It can be a string or a list of strings. If it is a list of strings, the model
198
+ will stop generating output when any of the strings in the list is encountered.
199
+ Note that this parameter does not override the default stop string of the model.
200
+
201
+ ------------------------------------------------
202
+
203
+ **Returns **
204
+ If ``stream `` is set to ``False ``, the response will be a ``ChatCompletionResponse `` object.
205
+ If ``stream `` is set to ``True ``, the response will be a stream of ``ChatCompletionStreamResponse `` objects.
206
+
83
207
.. http :get :: /chat/reset
84
208
85
209
Reset the chat.
@@ -92,6 +216,138 @@ The REST API provides the following endpoints:
92
216
93
217
Get the verbose runtime stats (encode/decode speed, total runtime).
94
218
219
+
220
+ Request Objects
221
+ ---------------
222
+
223
+ **ChatMessage **
224
+
225
+ **role **: *str * (required)
226
+ The role(author) of the message. It can be either ``user `` or ``assistant ``.
227
+ **content **: *str * (required)
228
+ The content of the message.
229
+ **name **: *str * (optional)
230
+ The name of the author of the message.
231
+
232
+ Response Objects
233
+ ----------------
234
+
235
+ **CompletionResponse **
236
+
237
+ **id **: *str *
238
+ The id of the completion.
239
+ **object **: *str *
240
+ The object name ``text.completion ``.
241
+ **created **: *int *
242
+ The time when the completion is created.
243
+ **choices **: *list[CompletionResponseChoice] *
244
+ A list of choices generated by the model.
245
+ **usage **: *UsageInfo * or *None *
246
+ The usage information of the model.
247
+
248
+ ------------------------------------------------
249
+
250
+ **CompletionResponseChoice **
251
+
252
+ **index **: *int *
253
+ The index of the choice.
254
+ **text **: *str *
255
+ The message generated by the model.
256
+ **finish_reason **: *str *
257
+ The reason why the model finishes generating the message. It can be either
258
+ ``stop `` or ``length ``.
259
+
260
+
261
+ ------------------------------------------------
262
+
263
+ **CompletionStreamResponse **
264
+
265
+ **id **: *str *
266
+ The id of the completion.
267
+ **object **: *str *
268
+ The object name ``text.completion.chunk ``.
269
+ **created **: *int *
270
+ The time when the completion is created.
271
+ **choices **: *list[ChatCompletionResponseStreamhoice] *
272
+ A list of choices generated by the model.
273
+
274
+ ------------------------------------------------
275
+
276
+ **ChatCompletionResponseStreamChoice **
277
+
278
+ **index **: *int *
279
+ The index of the choice.
280
+ **text **: *str *
281
+ The message generated by the model.
282
+ **finish_reason **: *str *
283
+ The reason why the model finishes generating the message. It can be either
284
+ ``stop `` or ``length ``.
285
+
286
+ ------------------------------------------------
287
+
288
+ **ChatCompletionResponse **
289
+
290
+ **id **: *str *
291
+ The id of the completion.
292
+ **object **: *str *
293
+ The object name ``chat.completion ``.
294
+ **created **: *int *
295
+ The time when the completion is created.
296
+ **choices **: *list[ChatCompletionResponseChoice] *
297
+ A list of choices generated by the model.
298
+ **usage **: *UsageInfo * or *None *
299
+ The usage information of the model.
300
+
301
+ ------------------------------------------------
302
+
303
+ **ChatCompletionResponseChoice **
304
+
305
+ **index **: *int *
306
+ The index of the choice.
307
+ **message **: *ChatMessage *
308
+ The message generated by the model.
309
+ **finish_reason **: *str *
310
+ The reason why the model finishes generating the message. It can be either
311
+ ``stop `` or ``length ``.
312
+
313
+ ------------------------------------------------
314
+
315
+ **ChatCompletionStreamResponse **
316
+
317
+ **id **: *str *
318
+ The id of the completion.
319
+ **object **: *str *
320
+ The object name ``chat.completion.chunk ``.
321
+ **created **: *int *
322
+ The time when the completion is created.
323
+ **choices **: *list[ChatCompletionResponseStreamhoice] *
324
+ A list of choices generated by the model.
325
+
326
+ ------------------------------------------------
327
+
328
+ **ChatCompletionResponseStreamChoice **
329
+
330
+ **index **: *int *
331
+ The index of the choice.
332
+ **delta **: *DeltaMessage *
333
+ The delta message generated by the model.
334
+ **finish_reason **: *str *
335
+ The reason why the model finishes generating the message. It can be either
336
+ ``stop `` or ``length ``.
337
+
338
+ ------------------------------------------------
339
+
340
+
341
+ **DeltaMessage **
342
+
343
+ **role **: *str *
344
+ The role(author) of the message. It can be either ``user `` or ``assistant ``.
345
+ **content **: *str *
346
+ The content of the message.
347
+
348
+ ------------------------------------------------
349
+
350
+
95
351
Use REST API in your own program
96
352
--------------------------------
97
353
0 commit comments