-
Notifications
You must be signed in to change notification settings - Fork 5
server : avoid common_batch #16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server : avoid common_batch #16
Conversation
ggml-ci
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
examples/server/server.cpp
Outdated
slot.batch_spec.clear(); | ||
slot.batch_spec.add_text(id, slot.n_past, slot.id, true); | ||
//slot.batch_spec.clear(); | ||
//slot.batch_spec.add_text(id, slot.n_past, slot.id, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these comments can be safely removed since we no longer have notion of common_batch
in the code base, these calls are specific to common_batch
5cebe54
to
b8b1732
Compare
Re. pooling == none, yes I completely agree that we can temporary disable it and reimplement in the future. For now it's only used by TTS so I guess it won't affect much. Merging this now, thanks! |
ref ggml-org#11875
Avoid usage of
common_batch
inllama-server
.The only use case that is not covered with this change is embeddings with pooling type ==
none
. In this case, we have to output an embedding vector for each input token.With the current API, it is a little bit difficult to implement this, without using a helper struct such as
common_batch
. Moreover, I think that onmaster
there is a bug where the tokens of a given slot could end up separated in different batches which would result in multiple partial results for a single task being emitted. This is probably very rare (requires parallel users and fragmented KV cache), so that's why we haven't experienced it yet.I am wondering if we should disable the option to have embeddings without pooling for now and figure out how to properly support this in the future.
@ngxson What do you think?