You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: llama_cpp/llama.py
+53-7Lines changed: 53 additions & 7 deletions
Original file line number
Diff line number
Diff line change
@@ -1863,13 +1863,27 @@ def create_completion(
1863
1863
suffix: A suffix to append to the generated text. If None, no suffix is appended.
1864
1864
max_tokens: The maximum number of tokens to generate. If max_tokens <= 0 or None, the maximum number of tokens to generate is unlimited and depends on n_ctx.
1865
1865
temperature: The temperature to use for sampling.
1866
-
top_p: The top-p value to use for sampling.
1866
+
top_p: The top-p value to use for nucleus sampling. Nucleus sampling described in academic paper "The Curious Case of Neural Text Degeneration" https://arxiv.org/abs/1904.09751
1867
+
min_p: The min-p value to use for minimum p sampling. Minimum P sampling as described in https://github.com/ggerganov/llama.cpp/pull/3841
1868
+
typical_p: The typical-p value to use for sampling. Locally Typical Sampling implementation described in the paper https://arxiv.org/abs/2202.00666.
1867
1869
logprobs: The number of logprobs to return. If None, no logprobs are returned.
1868
1870
echo: Whether to echo the prompt.
1869
1871
stop: A list of strings to stop generation when encountered.
1872
+
frequency_penalty: The penalty to apply to tokens based on their frequency in the prompt.
1873
+
presence_penalty: The penalty to apply to tokens based on their presence in the prompt.
1870
1874
repeat_penalty: The penalty to apply to repeated tokens.
1871
-
top_k: The top-k value to use for sampling.
1875
+
top_k: The top-k value to use for sampling. Top-K sampling described in academic paper "The Curious Case of Neural Text Degeneration" https://arxiv.org/abs/1904.09751
1872
1876
stream: Whether to stream the results.
1877
+
seed: The seed to use for sampling.
1878
+
tfs_z: The tail-free sampling parameter. Tail Free Sampling described in https://www.trentonbricken.com/Tail-Free-Sampling/.
1879
+
mirostat_mode: The mirostat sampling mode.
1880
+
mirostat_tau: The target cross-entropy (or surprise) value you want to achieve for the generated text. A higher value corresponds to more surprising or less predictable text, while a lower value corresponds to less surprising or more predictable text.
1881
+
mirostat_eta: The learning rate used to update `mu` based on the error between the target and observed surprisal of the sampled word. A larger learning rate will cause `mu` to be updated more quickly, while a smaller learning rate will result in slower updates.
1882
+
model: The name to use for the model in the completion object.
1883
+
stopping_criteria: A list of stopping criteria to use.
1884
+
logits_processor: A list of logits processors to use.
1885
+
grammar: A grammar to use for constrained sampling.
1886
+
logit_bias: A logit bias to use.
1873
1887
1874
1888
Raises:
1875
1889
ValueError: If the requested tokens exceed the context window.
@@ -1944,15 +1958,29 @@ def __call__(
1944
1958
Args:
1945
1959
prompt: The prompt to generate text from.
1946
1960
suffix: A suffix to append to the generated text. If None, no suffix is appended.
1947
-
max_tokens: The maximum number of tokens to generate. If max_tokens <= 0, the maximum number of tokens to generate is unlimited and depends on n_ctx.
1961
+
max_tokens: The maximum number of tokens to generate. If max_tokens <= 0 or None, the maximum number of tokens to generate is unlimited and depends on n_ctx.
1948
1962
temperature: The temperature to use for sampling.
1949
-
top_p: The top-p value to use for sampling.
1963
+
top_p: The top-p value to use for nucleus sampling. Nucleus sampling described in academic paper "The Curious Case of Neural Text Degeneration" https://arxiv.org/abs/1904.09751
1964
+
min_p: The min-p value to use for minimum p sampling. Minimum P sampling as described in https://github.com/ggerganov/llama.cpp/pull/3841
1965
+
typical_p: The typical-p value to use for sampling. Locally Typical Sampling implementation described in the paper https://arxiv.org/abs/2202.00666.
1950
1966
logprobs: The number of logprobs to return. If None, no logprobs are returned.
1951
1967
echo: Whether to echo the prompt.
1952
1968
stop: A list of strings to stop generation when encountered.
1969
+
frequency_penalty: The penalty to apply to tokens based on their frequency in the prompt.
1970
+
presence_penalty: The penalty to apply to tokens based on their presence in the prompt.
1953
1971
repeat_penalty: The penalty to apply to repeated tokens.
1954
-
top_k: The top-k value to use for sampling.
1972
+
top_k: The top-k value to use for sampling. Top-K sampling described in academic paper "The Curious Case of Neural Text Degeneration" https://arxiv.org/abs/1904.09751
1955
1973
stream: Whether to stream the results.
1974
+
seed: The seed to use for sampling.
1975
+
tfs_z: The tail-free sampling parameter. Tail Free Sampling described in https://www.trentonbricken.com/Tail-Free-Sampling/.
1976
+
mirostat_mode: The mirostat sampling mode.
1977
+
mirostat_tau: The target cross-entropy (or surprise) value you want to achieve for the generated text. A higher value corresponds to more surprising or less predictable text, while a lower value corresponds to less surprising or more predictable text.
1978
+
mirostat_eta: The learning rate used to update `mu` based on the error between the target and observed surprisal of the sampled word. A larger learning rate will cause `mu` to be updated more quickly, while a smaller learning rate will result in slower updates.
1979
+
model: The name to use for the model in the completion object.
1980
+
stopping_criteria: A list of stopping criteria to use.
1981
+
logits_processor: A list of logits processors to use.
1982
+
grammar: A grammar to use for constrained sampling.
1983
+
logit_bias: A logit bias to use.
1956
1984
1957
1985
Raises:
1958
1986
ValueError: If the requested tokens exceed the context window.
messages: A list of messages to generate a response for.
2055
+
functions: A list of functions to use for the chat completion.
2056
+
function_call: A function call to use for the chat completion.
2057
+
tools: A list of tools to use for the chat completion.
2058
+
tool_choice: A tool choice to use for the chat completion.
2027
2059
temperature: The temperature to use for sampling.
2028
-
top_p: The top-p value to use for sampling.
2029
-
top_k: The top-k value to use for sampling.
2060
+
top_p: The top-p value to use for nucleus sampling. Nucleus sampling described in academic paper "The Curious Case of Neural Text Degeneration" https://arxiv.org/abs/1904.09751
2061
+
top_k: The top-k value to use for sampling. Top-K sampling described in academic paper "The Curious Case of Neural Text Degeneration" https://arxiv.org/abs/1904.09751
2062
+
min_p: The min-p value to use for minimum p sampling. Minimum P sampling as described in https://github.com/ggerganov/llama.cpp/pull/3841
2063
+
typical_p: The typical-p value to use for sampling. Locally Typical Sampling implementation described in the paper https://arxiv.org/abs/2202.00666.
2030
2064
stream: Whether to stream the results.
2031
2065
stop: A list of strings to stop generation when encountered.
2066
+
seed: The seed to use for sampling.
2067
+
response_format: The response format to use for the chat completion. Use { "type": "json_object" } to contstrain output to only valid json.
2032
2068
max_tokens: The maximum number of tokens to generate. If max_tokens <= 0 or None, the maximum number of tokens to generate is unlimited and depends on n_ctx.
2069
+
presence_penalty: The penalty to apply to tokens based on their presence in the prompt.
2070
+
frequency_penalty: The penalty to apply to tokens based on their frequency in the prompt.
2033
2071
repeat_penalty: The penalty to apply to repeated tokens.
2072
+
tfs_z: The tail-free sampling parameter.
2073
+
mirostat_mode: The mirostat sampling mode.
2074
+
mirostat_tau: The mirostat sampling tau parameter.
2075
+
mirostat_eta: The mirostat sampling eta parameter.
2076
+
model: The name to use for the model in the completion object.
2077
+
logits_processor: A list of logits processors to use.
2078
+
grammar: A grammar to use.
2079
+
logit_bias: A logit bias to use.
2034
2080
2035
2081
Returns:
2036
2082
Generated chat completion or a stream of chat completion chunks.
0 commit comments