Skip to content

Conversation

@jhen0409
Copy link
Collaborator

@jhen0409 jhen0409 commented Aug 28, 2023

This fix the probabilities of stopping_word are included in the final response of /completion.

To test response without stream mode:

curl --url http://localhost:8080/completion --header "Content-Type: application/json" \
  --data '{ "n_probs": 1, "prompt": "Hello my name is", "stop": ["I"] }' | json_pp

With stream mode (see the last event):

curl -N --url http://localhost:8080/completion --header "Content-Type: application/json" \
  --data '{ "stream": true, "n_probs": 1, "prompt": "Hello my name is", "stop": ["I"] }'

Or see the console output in the web UI.

@ggerganov ggerganov changed the title server : avoid aniprompt in probabilities of final response server : avoid antiprompt in probabilities of final response Aug 28, 2023
@jhen0409 jhen0409 requested a review from SlyEcho September 1, 2023 01:10
Copy link
Contributor

@SlyEcho SlyEcho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that does the trick.

@jhen0409 jhen0409 merged commit 571083f into ggml-org:master Sep 2, 2023
@jhen0409 jhen0409 deleted the fix-server-final-probs branch September 2, 2023 00:31
sayap added a commit to sayap/ik_llama.cpp that referenced this pull request Nov 22, 2025
The logic to skip the logprobs of the stop token was originally from
ggml-org/llama.cpp#2849, and was later modified as part of
ggml-org/llama.cpp#10643 to be applied only to STOP_TYPE_WORD.

The latter change wasn't included in ikawrakow#723. Then, after ikawrakow#958 got merged,
the logic got inadvertently applied to GLM-4.5/4.6 and Kimi K2,
resulting in truncated logprobs when streaming is off.

This commit reverts the logic from ggml-org/llama.cpp#2849, such that
the logprobs of the stop token will always be included in the response,
when logprobs is enabled. From testing, this matches with the behavior
of Fireworks inference server, for both chat completions and text
completions endpoints.

Also fix logprobs param handling for the text completion endpoint.
ikawrakow pushed a commit to ikawrakow/ik_llama.cpp that referenced this pull request Nov 24, 2025
The logic to skip the logprobs of the stop token was originally from
ggml-org/llama.cpp#2849, and was later modified as part of
ggml-org/llama.cpp#10643 to be applied only to STOP_TYPE_WORD.

The latter change wasn't included in #723. Then, after #958 got merged,
the logic got inadvertently applied to GLM-4.5/4.6 and Kimi K2,
resulting in truncated logprobs when streaming is off.

This commit reverts the logic from ggml-org/llama.cpp#2849, such that
the logprobs of the stop token will always be included in the response,
when logprobs is enabled. From testing, this matches with the behavior
of Fireworks inference server, for both chat completions and text
completions endpoints.

Also fix logprobs param handling for the text completion endpoint.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants