Skip to content

Conversation

@jujipotle
Copy link
Contributor

@jujipotle jujipotle commented May 1, 2025

WIP

P0:

  • Configure LLMRouter to use PrefixAwareScheduler.
  • Investigate and resolve scheduling task out-of-order issue.
  • Update the tree with both input and response text using callbacks from the router.
  • Handle autoscaling.
  • Update the tree without causing race conditions.
  • Add tests.

P1:

  • Implement eviction policy for vLLM replicas
  • Investigate load balancing vs prefix hit rate
  • Stress-testing the tree itself
  • Investigate impact of tree size on traversal time
  • Investigate SGLang / Dynamo strategies

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

GeneDer and others added 23 commits April 1, 2025 08:04
Signed-off-by: Gene Su <[email protected]>
Signed-off-by: Gene Su <[email protected]>
Signed-off-by: Justin Ji <[email protected]>
…eploymentConfig currently not working

Signed-off-by: Justin Ji <[email protected]>
Signed-off-by: Justin Ji <[email protected]>
@kouroshHakha
Copy link
Contributor

This was also left from our discussion. For v0 we need some interface + example code like this (It doesn't have to work with yaml build pattern):

from ray import serve
from ray.serve.llm import LLMConfig, LLMServer, LLMRouter
from ray.serve.router import PrefixTreeDeployment
from ray.serve.replica_scheduler import PrefixAwareReplicaScheduler

llm_config = LLMConfig(
    model_loading_config=dict(
        model_id="qwen-0.5b",
        model_source="Qwen/Qwen2.5-0.5B-Instruct",
    ),
    deployment_config=dict(
        autoscaling_config=dict(
            min_replicas=1, max_replicas=2,
        )
    ),
    accelerator_type="A10G",
)


tree_deployement = PrefixTreeDeployment.bind()
# TODO: Some how make tree_deployment appear when you do 'serve.get_deployment_handle("xyz")`. 

# Deploy the application
deployment = LLMServer.as_deployment(llm_config.get_serve_options(name_prefix="vLLM:")).bind(llm_config)
deployment = deployment.options(replica_scheduler_class=PrefixAwareReplicaScheduler)
llm_app = LLMRouter.as_deployment().bind(llm_deplyments=[deployment], tree_deployment=tree_deployement)
serve.run(llm_app, blocking=True)

@eicherseiji
Copy link
Contributor

Results looking as expected following _benchmarking_scripts/replication_tutorial.md. Will be moving _benchmarking_scripts to an internal repo.
Screenshot 2025-06-03 at 6 02 17 PM
Screenshot 2025-06-03 at 6 02 22 PM
Screenshot 2025-06-03 at 6 02 27 PM
Screenshot 2025-06-03 at 6 02 33 PM

@eicherseiji
Copy link
Contributor

@eicherseiji
Copy link
Contributor

To change from the default prefix aware request router looks something like this:

from ray import serve
from ray.serve.llm import LLMConfig
from ray.serve._private.request_router.pow_2_router import PowerOfTwoChoicesRequestRouter
from ray.serve.llm import LLMConfig, build_openai_app

llm_config = LLMConfig(
    model_loading_config=dict(
        model_id="qwen-0.5b",
        model_source="Qwen/Qwen2.5-0.5B-Instruct",
    ),
    deployment_config=dict(
        autoscaling_config=dict(
            min_replicas=1, max_replicas=2,
        ),
        request_router_class=PowerOfTwoChoicesRequestRouter
    ),
)

app = build_openai_app({"llm_configs": [llm_config]})
serve.run(app, blocking=True)

Copy link
Contributor

@kouroshHakha kouroshHakha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just one major comment about not making this request router the default. For the rest of the stuff we can merge as is and come back to it during next iterations.

if count == min_count
]

def start_eviction_loop(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be more like a background thread. (event loop should not be kept busy because of eviction)

Copy link
Contributor

@kouroshHakha kouroshHakha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just one major comment about not making this request router the default. For the rest of the stuff we can merge as is and come back to it during next iterations.

Copy link
Contributor

@kouroshHakha kouroshHakha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just one major comment about not making this request router the default. For the rest of the stuff we can merge as is and come back to it during next iterations.

@kouroshHakha kouroshHakha enabled auto-merge (squash) June 9, 2025 16:20
@kouroshHakha kouroshHakha merged commit 93192cc into ray-project:master Jun 9, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community go add ONLY when ready to merge, run all tests llm serve Ray Serve Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants