Skip to content

PoC: server handling multiple clients with custom attention mask api #3462

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 11 commits into from

Conversation

FSSRepo
Copy link
Collaborator

@FSSRepo FSSRepo commented Oct 3, 2023

Hello, I know it's something no one asked for, but some of us need it. Here's a proof of concept of a server that handles multiple clients simultaneously, thanks to the new way of working with the KV cache.

Some may wonder why reimplementing this proposal in a separate example. The current implementation of the server is quite complex, and many things could break.

Tested on:

Windows 11 x64
Intel Core i5 11400H 6 C / 12 T
RTX 3050 laptop 4 GB VRAM
16 GB of RAM DDR4 3200MHz
Server.Parallel.Improvements.mp4

This is a proof of concept for now, with some feedback and assistance, we could make it more usable.

Here is the command to start the server:

./server-parallel -m models/7B/ggml-model.gguf --ctx_size 2048 -t 4 -ngl 33 --batch-size 512 --parallel 3 -n 512 --cont-batching --reverse-prompt "User:"

Modify --parallel to the number of slots to process clients requests.

Edit:

  • New video showing 4 clients at same time, my laptop almost exploded 😂.
  • Improve the PR note.

Note:

Many people are going to want to kill me when they see how I handle multithreading without using mutex; I never knew what that was for :(.

@ggerganov
Copy link
Member

Very nice job - thank you for initiating this work!

We will have to discuss what would be the best way to add this example to the project. A separate example as you have proposed is definitely an option, but we might want to consider merging it in the existing server.cpp. Tagging @SlyEcho @Green-Sky for thoughts on this

@FSSRepo
Copy link
Collaborator Author

FSSRepo commented Oct 4, 2023

but we might want to consider merging it in the existing server.cpp.

I think that adding more complexity to the server example, at least in terms of code readability, would be excessive. It should cover both stream-based and wait-for-process completion modes, as well as the sampling grammar for each client. I believe this would be beyond the scope of this PR, although I welcome opinions and suggestions. I will be working on trying to integrate this functionality directly into the server example in another branch. Wish me luck! 🤣🤣.

Edit:

Now that I review the situation carefully: it's a lot of work, debugging, and time 💀. Plus, I don't even know how some things like infill and server grammar work.

@jhen0409
Copy link
Collaborator

jhen0409 commented Oct 4, 2023

I think that adding more complexity to the server example, at least in terms of code readability, would be excessive. It should cover both stream-based and wait-for-process completion modes, as well as the sampling grammar for each client. I believe this would be beyond the scope of this PR, although I welcome opinions and suggestions. I will be working on trying to integrate this functionality directly into the server example in another branch. Wish me luck! 🤣🤣

I'm also experimenting to integrate the feature into server example too, but a slow progress. If you need any help please let me know.

@ggerganov
Copy link
Member

Yup, don't worry about this - I mentioned it as something to consider. Likely the standalone example is the better option

@SlyEcho
Copy link
Collaborator

SlyEcho commented Oct 5, 2023

I would like the main server be updated with this functionality, but it may need a bit of a rewrite.

@FSSRepo FSSRepo closed this Oct 5, 2023
@FSSRepo FSSRepo reopened this Oct 5, 2023
@FSSRepo
Copy link
Collaborator Author

FSSRepo commented Oct 5, 2023

I wanted to update my fork to the latest changes from the master branch, but it went wrong :(.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants