Skip to content

vLLM backend for Cortex #1890

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ramonpzg opened this issue Jan 27, 2025 · 5 comments · May be fixed by #2010
Open

vLLM backend for Cortex #1890

ramonpzg opened this issue Jan 27, 2025 · 5 comments · May be fixed by #2010
Assignees
Labels

Comments

@ramonpzg
Copy link
Contributor

ramonpzg commented Jan 27, 2025

At the moment, the process by which a user can build a custom python engine to deploy a model via cortex is not straightforward in code or clear in the docs. The plan is:

  • improve the process for building a custom python engine
  • remove unnecessary parameters in the model.yml config
  • improve the documentation
  • add examples of different custom engines to the docs

Goals

  • vllm is python lib to run on linux machine
  • vllm can run llm

Tasks

Obstacle

  • huge effort to maintain this in cortex

Out of scope

  • Audio
@ramonpzg ramonpzg added P1: important Important feature / fix P2: enhancement low impact on functionality labels Jan 27, 2025
@ramonpzg ramonpzg added this to the v1.0.9 milestone Jan 27, 2025
@ramonpzg ramonpzg self-assigned this Jan 27, 2025
@ramonpzg ramonpzg moved this to Investigating in Menlo Jan 27, 2025
@ramonpzg ramonpzg added this to Menlo Jan 27, 2025
@gau-nernst gau-nernst self-assigned this Feb 3, 2025
@gau-nernst
Copy link
Contributor

gau-nernst commented Feb 13, 2025

A "Python engine" actually has a larger scope than the current llama.cpp engine:

  • Python engine should be able to run any arbitrary Python programs
  • llama.cpp can only run models supported by llama.cpp, and must be in supported format i.e. GGUF.

Therefore, having a Python engine means we have to solve the 2 following problems

  1. Dependency management
  2. How to expose Python engine

Dependency management

uv is the new cool kid in town. Not only it's reportedly more robust than pip+venv, it's fast and provides many convenience methods.

Two interesting "modes"

  • Script mode: single .py file
    • Declare deps on top of the file
    • uv run app.py -> automatically install deps, put in an isolated env, run the app
  • Project mode: a folder with pyproject.toml
    • Normal Python project. Declare deps within pyproject.toml
    • Within the project folder uv run main.py -> automatically create a local .venv, install deps, install the Python project itself, and run the app
    • If the Python project exposes a CLI command, it's possible to run the CLI command too.

How to expose Python engine

cortex-server is the core C++ webserver (in Drogon) that serves HTTP requests. We can either

  1. Embed Python interpreter within C++ cortex-server code https://docs.python.org/3/extending/embedding.html
  2. Run Python app in a separate process -> we probably go with this since it seems easier. For this, there are 2 sub-options
    a. Python<->cortex-webserver with some high-perf IPC (inter-process communication). cortex-server will still serve the HTTP requests
    b. Python has its own webserver (user has to define or we provide?). Again, 2 sub-sub-options
    • External requests will go directly to Python webserver instead of cortex-server (each process must use different ports?)
    • External requests will go to cortex-server first, then cortex-server just route the request to Python webserver

For 2a, I foresee a potential problem: how we will configure/propagate the function signature / API contract from Python to cortex-server.
E.g. Python text-to-speech will expose def tts(text: str) -> bytes (output is encoded audio e.g. mp3) -> how do we avoid hard-coding this in cortex-server code

Case study - Current Python engine

Extension point

Using uv, apart from serving a model, we can also use it to run Python CLI programs e.g. https://github.com/janhq/robobench

@ramonpzg
Copy link
Contributor Author

Managing dependencies and running scripts with uv is a great idea. We'll need to figure out if this is possible to have this happen from within C++ (they explicitly don't expose classes in Python to import uv as a normal library but we could double-check this).

In the "How to Expose the Python Engine" section, we could create a tight integration between the two languages by exposing Python objects and tools as C++ ones and vice-versa using packages like pybind11 and scikit-build-core. This could potentially address the 2a concern on how to propagate signature functions like tts() from Python to C++ and vice-versa.

As @gau-nernst mentioned, running Python processes as separate HTTP servers and having thecortex-server act as a reverse proxy/router would be THE solution worth exploring first (IMO).

Some thoughts on a potential action plan to test this:

  1. The cortex-server spawns a Python process via uv run, specifying a temporary port or ports that are dynamically assigned and the cortex-server tracks them in a registry (cortex itself lacks this right now and they have to be manually changed in the .cortexrc or on starting a server with the -p flag)
  2. The Python process sends its API schema to cortex-server's registration endpoint, for example /metadata. The schemas will define input/output types and provide validation and routing without having to hardcode stuff (ask users to provide a pydantic object(s) or json object, for example)
  3. On termination, the cortex-server removes the model from the routing table.
  4. Cortex-server validates incoming requests against the schema before proxying.
    • Modify Drogon routes to resolve URLs to backend Python ports using the registry.
    • Optionally extend to multiple instances of a model.
  5. For high throughput use gRPC (therefore the python process has different endpoints for this)

Note: I will update this as I give it more thought

@gau-nernst gau-nernst linked a pull request Feb 21, 2025 that will close this issue
18 tasks
@ramonpzg ramonpzg added this to Jan Mar 13, 2025
@ramonpzg ramonpzg moved this to In Progress in Jan Mar 13, 2025
@dan-menlo dan-menlo added the type: epic A major feature or initiative label Mar 13, 2025
@dan-menlo
Copy link
Contributor

Is it possible to have sandboxed environments? Marimo seems to have done it with uv:

https://github.com/marimo-team/marimo/releases/tag/0.8.4

@dan-menlo dan-menlo changed the title enhancement: improvement to Python Engine's compatibility with cortex epic: Python Engine's compatibility with cortex Mar 13, 2025
@gau-nernst
Copy link
Contributor

Run marimo notebooks in package sandboxes. Use
marimo edit --sandbox notebook.py
to edit a Python notebook in a completely isolated virtual environment!

It's just having a separate virtual env to avoid package conflicts (the current Python PR is ready doing this). It's not sandbox in the security sense (Python script shouldn't be allowed to access host files, internet...)

@ramonpzg ramonpzg changed the title epic: Python Engine's compatibility with cortex Python Engine's compatibility with cortex Mar 16, 2025
@ramonpzg ramonpzg added epic and removed P1: important Important feature / fix type: epic A major feature or initiative P2: enhancement low impact on functionality labels Mar 16, 2025
@ramonpzg ramonpzg modified the milestones: v1.0.9, Caffeinated Sloth Mar 16, 2025
@gau-nernst gau-nernst moved this from Investigating to In Progress in Menlo Mar 17, 2025
@gau-nernst gau-nernst changed the title Python Engine's compatibility with cortex feat: vLLM backend for Cortex Mar 17, 2025
@gau-nernst
Copy link
Contributor

As discussed previously, to limit the scope of this feature, we won't expose Python engine directly but only specific applications. For the current milestone, we aim to add vLLM as an alternative backend for Cortex.

@ramonpzg ramonpzg changed the title feat: vLLM backend for Cortex vLLM backend for Cortex Mar 19, 2025
@david-menloai david-menloai changed the title vLLM backend for Cortex Epic: vLLM backend for Cortex Apr 2, 2025
@david-menloai david-menloai changed the title Epic: vLLM backend for Cortex vLLM backend for Cortex Apr 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: In Progress
Status: In Progress
Development

Successfully merging a pull request may close this issue.

3 participants