Skip to content

Serve multiple models with [server] #906

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bioshazard opened this issue Nov 13, 2023 · 5 comments
Closed

Serve multiple models with [server] #906

bioshazard opened this issue Nov 13, 2023 · 5 comments
Labels
enhancement New feature or request

Comments

@bioshazard
Copy link
Contributor

lcp[server] has been excellent. And I can host two models by running a second instance.

I'd like to be able to serve multiple models with a single instance of the OpenAI-compatible server and switch between them based on alias-able model in the query payload. My use case is to serve a code model and bakllava at the same time.

I am going to see about attempting to PR an Nginx configuration example that would reverse proxy to two instances based on the model in the POST body, but first order support would be great. If no one picks this up, I might attempt a PR of first order support early '24.

@bioshazard
Copy link
Contributor Author

Might also help to serve another model specifically for embeddings than to rely on whatever my latest 7B obsession does.

@abetlen abetlen added the enhancement New feature or request label Nov 13, 2023
@NewtonTrendy
Copy link

I would like this too!

@abetlen
Copy link
Owner

abetlen commented Dec 22, 2023

Implemented in #931 next up is #736

@abetlen abetlen closed this as completed Dec 22, 2023
@bioshazard
Copy link
Contributor Author

Thank you!! And wow #736 is a huge idea.

@milkymap
Copy link

I've been working on a system that could potentially solve some of the limitations in the current server implementation. Specifically, I'm developing a solution that allows for:

  1. Loading and maintaining multiple models simultaneously in memory
  2. Parallel processing of requests across different models
  3. Dynamic model loading and unloading based on demand

This approach could significantly enhance performance and flexibility, especially for applications requiring rapid switching between different models or concurrent use of multiple models.

I'd be happy to share more details or contribute to implementing this feature if there's interest. The system uses ZeroMQ for efficient inter-process communication and asyncio for non-blocking operations, allowing for high concurrency and scalability.

Let me know if you'd like to discuss this further or if you have any questions about the proposed implementation."​​​​​​​​​​​​​​​

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants