Skip to content

Commit 0e08845

Browse files
anibohara2000Animesh Bohara
andauthored
[RestAPI] Added docs (mlc-ai#1193)
Add docs for RestAPI Co-authored-by: Animesh Bohara <[email protected]>
1 parent 3417505 commit 0e08845

File tree

2 files changed

+260
-3
lines changed

2 files changed

+260
-3
lines changed

docs/deploy/rest.rst

Lines changed: 256 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,12 +74,136 @@ The REST API provides the following endpoints:
7474

7575
.. http:get:: /v1/completions
7676
77+
------------------------------------------------
78+
7779
Get a completion from MLC-Chat using a prompt.
7880

81+
**Request body**
82+
83+
**model**: *str* (required)
84+
The model folder after compiling with MLC-LLM build process. The parameter
85+
can either be the model name with its quantization scheme
86+
(e.g. ``Llama-2-7b-chat-hf-q4f16_1``), or a full path to the model
87+
folder. In the former case, we will use the provided name to search
88+
for the model folder over possible paths.
89+
**prompt**: *str* (required)
90+
A list of chat messages. The last message should be from the user.
91+
**stream**: *bool* (optional)
92+
Whether to stream the response. If ``True``, the response will be streamed
93+
as the model generates the response. If ``False``, the response will be
94+
returned after the model finishes generating the response.
95+
**temperature**: *float* (optional)
96+
The temperature applied to logits before sampling. The default value is
97+
``0.7``. A higher temperature encourages more diverse outputs, while a
98+
lower temperature produces more deterministic outputs.
99+
**top_p**: *float* (optional)
100+
This parameter determines the set of tokens from which we sample during
101+
decoding. The default value is set to ``0.95``. At each step, we select
102+
tokens from the minimal set that has a cumulative probability exceeding
103+
the ``top_p`` parameter.
104+
105+
For additional information on top-p sampling, please refer to this blog
106+
post: https://huggingface.co/blog/how-to-generate#top-p-nucleus-sampling.
107+
**repetition_penalty**: *float* (optional)
108+
The repetition penalty controls the likelihood of the model generating
109+
repeated texts. The default value is set to ``1.0``, indicating that no
110+
repetition penalty is applied. Increasing the value reduces the
111+
likelihood of repeat text generation. However, setting a high
112+
``repetition_penalty`` may result in the model generating meaningless
113+
texts. The ideal choice of repetition penalty may vary among models.
114+
115+
For more details on how repetition penalty controls text generation, please
116+
check out the CTRL paper (https://arxiv.org/pdf/1909.05858.pdf).
117+
**presence_penalty**: *float* (optional)
118+
Positive values penalize new tokens if they are already present in the text so far,
119+
decreasing the model's likelihood to repeat tokens.
120+
**frequency_penalty**: *float* (optional)
121+
Positive values penalize new tokens based on their existing frequency in the text so far,
122+
decreasing the model's likelihood to repeat tokens.
123+
**mean_gen_len**: *int* (optional)
124+
The approximated average number of generated tokens in each round. Used
125+
to determine whether the maximum window size would be exceeded.
126+
**max_gen_len**: *int* (optional)
127+
This parameter determines the maximum length of the generated text. If it is
128+
not set, the model will generate text until it encounters a stop token.
129+
130+
------------------------------------------------
131+
132+
**Returns**
133+
If ``stream`` is set to ``False``, the response will be a ``CompletionResponse`` object.
134+
If ``stream`` is set to ``True``, the response will be a stream of ``CompletionStreamResponse`` objects.
135+
136+
79137
.. http:get:: /v1/chat/completions
80138
139+
------------------------------------------------
140+
81141
Get a response from MLC-Chat using a prompt, either with or without streaming.
82142

143+
**Request body**
144+
145+
**model**: *str* (required)
146+
The model folder after compiling with MLC-LLM build process. The parameter
147+
can either be the model name with its quantization scheme
148+
(e.g. ``Llama-2-7b-chat-hf-q4f16_1``), or a full path to the model
149+
folder. In the former case, we will use the provided name to search
150+
for the model folder over possible paths.
151+
**messages**: *list[ChatMessage]* (required)
152+
A list of chat messages. The last message should be from the user.
153+
**stream**: *bool* (optional)
154+
Whether to stream the response. If ``True``, the response will be streamed
155+
as the model generates the response. If ``False``, the response will be
156+
returned after the model finishes generating the response.
157+
**temperature**: *float* (optional)
158+
The temperature applied to logits before sampling. The default value is
159+
``0.7``. A higher temperature encourages more diverse outputs, while a
160+
lower temperature produces more deterministic outputs.
161+
**top_p**: *float* (optional)
162+
This parameter determines the set of tokens from which we sample during
163+
decoding. The default value is set to ``0.95``. At each step, we select
164+
tokens from the minimal set that has a cumulative probability exceeding
165+
the ``top_p`` parameter.
166+
167+
For additional information on top-p sampling, please refer to this blog
168+
post: https://huggingface.co/blog/how-to-generate#top-p-nucleus-sampling.
169+
**repetition_penalty**: *float* (optional)
170+
The repetition penalty controls the likelihood of the model generating
171+
repeated texts. The default value is set to ``1.0``, indicating that no
172+
repetition penalty is applied. Increasing the value reduces the
173+
likelihood of repeat text generation. However, setting a high
174+
``repetition_penalty`` may result in the model generating meaningless
175+
texts. The ideal choice of repetition penalty may vary among models.
176+
177+
For more details on how repetition penalty controls text generation, please
178+
check out the CTRL paper (https://arxiv.org/pdf/1909.05858.pdf).
179+
**presence_penalty**: *float* (optional)
180+
Positive values penalize new tokens if they are already present in the text so far,
181+
decreasing the model's likelihood to repeat tokens.
182+
**frequency_penalty**: *float* (optional)
183+
Positive values penalize new tokens based on their existing frequency in the text so far,
184+
decreasing the model's likelihood to repeat tokens.
185+
**mean_gen_len**: *int* (optional)
186+
The approximated average number of generated tokens in each round. Used
187+
to determine whether the maximum window size would be exceeded.
188+
**max_gen_len**: *int* (optional)
189+
This parameter determines the maximum length of the generated text. If it is
190+
not set, the model will generate text until it encounters a stop token.
191+
**n**: *int* (optional)
192+
This parameter determines the number of text samples to generate. The default
193+
value is ``1``. Note that this parameter is only used when ``stream`` is set to
194+
``False``.
195+
**stop**: *str* or *list[str]* (optional)
196+
When ``stop`` is encountered, the model will stop generating output.
197+
It can be a string or a list of strings. If it is a list of strings, the model
198+
will stop generating output when any of the strings in the list is encountered.
199+
Note that this parameter does not override the default stop string of the model.
200+
201+
------------------------------------------------
202+
203+
**Returns**
204+
If ``stream`` is set to ``False``, the response will be a ``ChatCompletionResponse`` object.
205+
If ``stream`` is set to ``True``, the response will be a stream of ``ChatCompletionStreamResponse`` objects.
206+
83207
.. http:get:: /chat/reset
84208
85209
Reset the chat.
@@ -92,6 +216,138 @@ The REST API provides the following endpoints:
92216
93217
Get the verbose runtime stats (encode/decode speed, total runtime).
94218

219+
220+
Request Objects
221+
---------------
222+
223+
**ChatMessage**
224+
225+
**role**: *str* (required)
226+
The role(author) of the message. It can be either ``user`` or ``assistant``.
227+
**content**: *str* (required)
228+
The content of the message.
229+
**name**: *str* (optional)
230+
The name of the author of the message.
231+
232+
Response Objects
233+
----------------
234+
235+
**CompletionResponse**
236+
237+
**id**: *str*
238+
The id of the completion.
239+
**object**: *str*
240+
The object name ``text.completion``.
241+
**created**: *int*
242+
The time when the completion is created.
243+
**choices**: *list[CompletionResponseChoice]*
244+
A list of choices generated by the model.
245+
**usage**: *UsageInfo* or *None*
246+
The usage information of the model.
247+
248+
------------------------------------------------
249+
250+
**CompletionResponseChoice**
251+
252+
**index**: *int*
253+
The index of the choice.
254+
**text**: *str*
255+
The message generated by the model.
256+
**finish_reason**: *str*
257+
The reason why the model finishes generating the message. It can be either
258+
``stop`` or ``length``.
259+
260+
261+
------------------------------------------------
262+
263+
**CompletionStreamResponse**
264+
265+
**id**: *str*
266+
The id of the completion.
267+
**object**: *str*
268+
The object name ``text.completion.chunk``.
269+
**created**: *int*
270+
The time when the completion is created.
271+
**choices**: *list[ChatCompletionResponseStreamhoice]*
272+
A list of choices generated by the model.
273+
274+
------------------------------------------------
275+
276+
**ChatCompletionResponseStreamChoice**
277+
278+
**index**: *int*
279+
The index of the choice.
280+
**text**: *str*
281+
The message generated by the model.
282+
**finish_reason**: *str*
283+
The reason why the model finishes generating the message. It can be either
284+
``stop`` or ``length``.
285+
286+
------------------------------------------------
287+
288+
**ChatCompletionResponse**
289+
290+
**id**: *str*
291+
The id of the completion.
292+
**object**: *str*
293+
The object name ``chat.completion``.
294+
**created**: *int*
295+
The time when the completion is created.
296+
**choices**: *list[ChatCompletionResponseChoice]*
297+
A list of choices generated by the model.
298+
**usage**: *UsageInfo* or *None*
299+
The usage information of the model.
300+
301+
------------------------------------------------
302+
303+
**ChatCompletionResponseChoice**
304+
305+
**index**: *int*
306+
The index of the choice.
307+
**message**: *ChatMessage*
308+
The message generated by the model.
309+
**finish_reason**: *str*
310+
The reason why the model finishes generating the message. It can be either
311+
``stop`` or ``length``.
312+
313+
------------------------------------------------
314+
315+
**ChatCompletionStreamResponse**
316+
317+
**id**: *str*
318+
The id of the completion.
319+
**object**: *str*
320+
The object name ``chat.completion.chunk``.
321+
**created**: *int*
322+
The time when the completion is created.
323+
**choices**: *list[ChatCompletionResponseStreamhoice]*
324+
A list of choices generated by the model.
325+
326+
------------------------------------------------
327+
328+
**ChatCompletionResponseStreamChoice**
329+
330+
**index**: *int*
331+
The index of the choice.
332+
**delta**: *DeltaMessage*
333+
The delta message generated by the model.
334+
**finish_reason**: *str*
335+
The reason why the model finishes generating the message. It can be either
336+
``stop`` or ``length``.
337+
338+
------------------------------------------------
339+
340+
341+
**DeltaMessage**
342+
343+
**role**: *str*
344+
The role(author) of the message. It can be either ``user`` or ``assistant``.
345+
**content**: *str*
346+
The content of the message.
347+
348+
------------------------------------------------
349+
350+
95351
Use REST API in your own program
96352
--------------------------------
97353

python/mlc_chat/interface/openai_api.py

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -107,13 +107,14 @@ class CompletionRequest(BaseModel):
107107
class CompletionResponseChoice(BaseModel):
108108
index: int
109109
text: str
110-
logprobs: int | None = None
111110
finish_reason: Literal["stop", "length"] | None = None
111+
# TODO: logprobs support
112+
logprobs: int | None = None
112113

113114

114115
class CompletionResponse(BaseModel):
115116
id: str = Field(default_factory=lambda: f"cmpl-{shortuuid.random()}")
116-
object: str = "text_completion"
117+
object: str = "text.completion"
117118
created: int = Field(default_factory=lambda: int(time.time()))
118119
choices: list[CompletionResponseChoice]
119120
usage: UsageInfo
@@ -127,7 +128,7 @@ class CompletionResponseStreamChoice(BaseModel):
127128

128129
class CompletionStreamResponse(BaseModel):
129130
id: str = Field(default_factory=lambda: f"cmpl-{shortuuid.random()}")
130-
object: str = "text_completion"
131+
object: str = "text.completion.chunk"
131132
created: int = Field(default_factory=lambda: int(time.time()))
132133
choices: List[CompletionResponseStreamChoice]
133134

0 commit comments

Comments
 (0)