-
Notifications
You must be signed in to change notification settings - Fork 12k
PoC: server handling multiple clients with custom attention mask api #3462
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Very nice job - thank you for initiating this work! We will have to discuss what would be the best way to add this example to the project. A separate example as you have proposed is definitely an option, but we might want to consider merging it in the existing |
I think that adding more complexity to the server example, at least in terms of code readability, would be excessive. It should cover both stream-based and wait-for-process completion modes, as well as the sampling grammar for each client. I believe this would be beyond the scope of this PR, although I welcome opinions and suggestions. I will be working on trying to integrate this functionality directly into the server example in another branch. Wish me luck! 🤣🤣. Edit: Now that I review the situation carefully: it's a lot of work, debugging, and time 💀. Plus, I don't even know how some things like infill and server grammar work. |
I'm also experimenting to integrate the feature into server example too, but a slow progress. If you need any help please let me know. |
Yup, don't worry about this - I mentioned it as something to consider. Likely the standalone example is the better option |
I would like the main server be updated with this functionality, but it may need a bit of a rewrite. |
I wanted to update my fork to the latest changes from the master branch, but it went wrong :(. |
Hello, I know it's something no one asked for, but some of us need it. Here's a proof of concept of a server that handles multiple clients simultaneously, thanks to the new way of working with the KV cache.
Some may wonder why reimplementing this proposal in a separate example. The current implementation of the server is quite complex, and many things could break.
Tested on:
Server.Parallel.Improvements.mp4
This is a proof of concept for now, with some feedback and assistance, we could make it more usable.
Here is the command to start the server:
./server-parallel -m models/7B/ggml-model.gguf --ctx_size 2048 -t 4 -ngl 33 --batch-size 512 --parallel 3 -n 512 --cont-batching --reverse-prompt "User:"
Modify
--parallel
to the number of slots to process clients requests.Edit:
Note:
Many people are going to want to kill me when they see how I handle multithreading without using mutex; I never knew what that was for :(.