PoC: server handling multiple clients with custom attention mask api #3462

FSSRepo · 2023-10-03T22:00:22Z

Hello, I know it's something no one asked for, but some of us need it. Here's a proof of concept of a server that handles multiple clients simultaneously, thanks to the new way of working with the KV cache.

Some may wonder why reimplementing this proposal in a separate example. The current implementation of the server is quite complex, and many things could break.

Tested on:

Windows 11 x64
Intel Core i5 11400H 6 C / 12 T
RTX 3050 laptop 4 GB VRAM
16 GB of RAM DDR4 3200MHz

Server.Parallel.Improvements.mp4

This is a proof of concept for now, with some feedback and assistance, we could make it more usable.

Here is the command to start the server:

./server-parallel -m models/7B/ggml-model.gguf --ctx_size 2048 -t 4 -ngl 33 --batch-size 512 --parallel 3 -n 512 --cont-batching --reverse-prompt "User:"

Modify --parallel to the number of slots to process clients requests.

Edit:

New video showing 4 clients at same time, my laptop almost exploded 😂.
Improve the PR note.

Note:

Many people are going to want to kill me when they see how I handle multithreading without using mutex; I never knew what that was for :(.

ggerganov · 2023-10-04T15:02:21Z

Very nice job - thank you for initiating this work!

We will have to discuss what would be the best way to add this example to the project. A separate example as you have proposed is definitely an option, but we might want to consider merging it in the existing server.cpp. Tagging @SlyEcho @Green-Sky for thoughts on this

FSSRepo · 2023-10-04T19:35:52Z

but we might want to consider merging it in the existing server.cpp.

I think that adding more complexity to the server example, at least in terms of code readability, would be excessive. It should cover both stream-based and wait-for-process completion modes, as well as the sampling grammar for each client. I believe this would be beyond the scope of this PR, although I welcome opinions and suggestions. I will be working on trying to integrate this functionality directly into the server example in another branch. Wish me luck! 🤣🤣.

Edit:

Now that I review the situation carefully: it's a lot of work, debugging, and time 💀. Plus, I don't even know how some things like infill and server grammar work.

jhen0409 · 2023-10-04T19:56:58Z

I think that adding more complexity to the server example, at least in terms of code readability, would be excessive. It should cover both stream-based and wait-for-process completion modes, as well as the sampling grammar for each client. I believe this would be beyond the scope of this PR, although I welcome opinions and suggestions. I will be working on trying to integrate this functionality directly into the server example in another branch. Wish me luck! 🤣🤣

I'm also experimenting to integrate the feature into server example too, but a slow progress. If you need any help please let me know.

ggerganov · 2023-10-04T19:57:01Z

Yup, don't worry about this - I mentioned it as something to consider. Likely the standalone example is the better option

SlyEcho · 2023-10-05T17:22:11Z

I would like the main server be updated with this functionality, but it may need a bit of a rewrite.

FSSRepo · 2023-10-05T18:59:35Z

I wanted to update my fork to the latest changes from the master branch, but it went wrong :(.

FSSRepo added 3 commits October 3, 2023 17:39

server: handle multiple completions with cam

6da48f7

fix ci errors

13916dc

fix mac os ci error: string -> char*

598e74c

some var names, state fixes + improvement performance

cbd632a

FSSRepo added 7 commits October 5, 2023 14:36

add change system prompt on runtime, improve README

bd13ea3

server: handle multiple completions with cam

e44cef8

fix ci errors

33c2a25

fix mac os ci error: string -> char*

9a1039d

some var names, state fixes + improvement performance

9e6e714

add change system prompt on runtime, improve README

dc102b4

Merge branch 'master' of https://github.com/FSSRepo/llama.cpp

5354916

FSSRepo closed this Oct 5, 2023

FSSRepo reopened this Oct 5, 2023

FSSRepo closed this Oct 5, 2023

FSSRepo mentioned this pull request Oct 5, 2023

PoC: server handling multiple clients with custom attention mask api #3490

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PoC: server handling multiple clients with custom attention mask api #3462

PoC: server handling multiple clients with custom attention mask api #3462

Uh oh!

FSSRepo commented Oct 3, 2023 •

edited

Loading

Uh oh!

ggerganov commented Oct 4, 2023

Uh oh!

FSSRepo commented Oct 4, 2023 •

edited

Loading

Uh oh!

jhen0409 commented Oct 4, 2023

Uh oh!

ggerganov commented Oct 4, 2023

Uh oh!

SlyEcho commented Oct 5, 2023

Uh oh!

FSSRepo commented Oct 5, 2023

Uh oh!

Uh oh!

PoC: server handling multiple clients with custom attention mask api #3462

PoC: server handling multiple clients with custom attention mask api #3462

Uh oh!

Conversation

FSSRepo commented Oct 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Oct 4, 2023

Uh oh!

FSSRepo commented Oct 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jhen0409 commented Oct 4, 2023

Uh oh!

ggerganov commented Oct 4, 2023

Uh oh!

SlyEcho commented Oct 5, 2023

Uh oh!

FSSRepo commented Oct 5, 2023

Uh oh!

Uh oh!

FSSRepo commented Oct 3, 2023 •

edited

Loading

FSSRepo commented Oct 4, 2023 •

edited

Loading