From 7923116f1afa3d0e172fa4cdc023b9781bc3b1f4 Mon Sep 17 00:00:00 2001 From: DarkLight1337 Date: Mon, 6 Jan 2025 09:27:00 +0000 Subject: [PATCH 01/11] Reorganize Serving section Signed-off-by: DarkLight1337 --- .../architecture_helm_deployment.png | Bin .../contributing/dockerfile/dockerfile.md | 2 +- .../docker.md} | 4 +- .../frameworks/bentoml.md} | 4 +- .../frameworks/cerebrium.md} | 4 +- .../frameworks/dstack.md} | 4 +- .../frameworks/helm.md} | 6 +- docs/source/deployment/frameworks/index.md | 13 +++ .../frameworks/lws.md} | 4 +- .../frameworks/skypilot.md} | 4 +- .../frameworks/triton.md} | 4 +- docs/source/deployment/integrations/index.md | 9 ++ .../integrations/kserve.md} | 4 +- .../integrations/kubeai.md} | 4 +- .../integrations/llamastack.md} | 4 +- .../k8s.md} | 6 +- .../nginx.md} | 2 +- .../getting_started/installation/hpu-gaudi.md | 2 +- docs/source/getting_started/quickstart.md | 22 ++--- docs/source/index.md | 25 ++++-- docs/source/models/loaders/index.md | 8 ++ .../loaders}/runai_model_streamer.md | 2 +- .../{serving => models/loaders}/tensorizer.md | 2 +- docs/source/serving/integrations.md | 17 ---- docs/source/serving/integrations/index.md | 8 ++ .../langchain.md} | 8 +- .../llamaindex.md} | 8 +- docs/source/serving/metrics.md | 2 +- .../{usage => serving}/multimodal_inputs.md | 0 docs/source/serving/offline_inference.md | 79 ++++++++++++++++++ .../serving/openai_compatible_server.md | 10 ++- 31 files changed, 191 insertions(+), 80 deletions(-) rename docs/source/{serving => assets/deployment}/architecture_helm_deployment.png (100%) rename docs/source/{serving/deploying_with_docker.md => deployment/docker.md} (98%) rename docs/source/{serving/deploying_with_bentoml.md => deployment/frameworks/bentoml.md} (89%) rename docs/source/{serving/deploying_with_cerebrium.md => deployment/frameworks/cerebrium.md} (98%) rename docs/source/{serving/deploying_with_dstack.md => deployment/frameworks/dstack.md} (98%) rename docs/source/{serving/deploying_with_helm.md => deployment/frameworks/helm.md} (98%) create mode 100644 docs/source/deployment/frameworks/index.md rename docs/source/{serving/deploying_with_lws.md => deployment/frameworks/lws.md} (91%) rename docs/source/{serving/run_on_sky.md => deployment/frameworks/skypilot.md} (99%) rename docs/source/{serving/deploying_with_triton.md => deployment/frameworks/triton.md} (87%) create mode 100644 docs/source/deployment/integrations/index.md rename docs/source/{serving/deploying_with_kserve.md => deployment/integrations/kserve.md} (85%) rename docs/source/{serving/deploying_with_kubeai.md => deployment/integrations/kubeai.md} (93%) rename docs/source/{serving/serving_with_llamastack.md => deployment/integrations/llamastack.md} (95%) rename docs/source/{serving/deploying_with_k8s.md => deployment/k8s.md} (99%) rename docs/source/{serving/deploying_with_nginx.md => deployment/nginx.md} (99%) create mode 100644 docs/source/models/loaders/index.md rename docs/source/{serving => models/loaders}/runai_model_streamer.md (98%) rename docs/source/{serving => models/loaders}/tensorizer.md (95%) delete mode 100644 docs/source/serving/integrations.md create mode 100644 docs/source/serving/integrations/index.md rename docs/source/serving/{serving_with_langchain.md => integrations/langchain.md} (82%) rename docs/source/serving/{serving_with_llamaindex.md => integrations/llamaindex.md} (74%) rename docs/source/{usage => serving}/multimodal_inputs.md (100%) create mode 100644 docs/source/serving/offline_inference.md diff --git a/docs/source/serving/architecture_helm_deployment.png b/docs/source/assets/deployment/architecture_helm_deployment.png similarity index 100% rename from docs/source/serving/architecture_helm_deployment.png rename to docs/source/assets/deployment/architecture_helm_deployment.png diff --git a/docs/source/contributing/dockerfile/dockerfile.md b/docs/source/contributing/dockerfile/dockerfile.md index 7ffec83333d7..38ea956ba8df 100644 --- a/docs/source/contributing/dockerfile/dockerfile.md +++ b/docs/source/contributing/dockerfile/dockerfile.md @@ -1,7 +1,7 @@ # Dockerfile We provide a to construct the image for running an OpenAI compatible server with vLLM. -More information about deploying with Docker can be found [here](../../serving/deploying_with_docker.md). +More information about deploying with Docker can be found [here](#deployment-docker). Below is a visual representation of the multi-stage Dockerfile. The build graph contains the following nodes: diff --git a/docs/source/serving/deploying_with_docker.md b/docs/source/deployment/docker.md similarity index 98% rename from docs/source/serving/deploying_with_docker.md rename to docs/source/deployment/docker.md index 844bd27800c7..2df1aca27f1e 100644 --- a/docs/source/serving/deploying_with_docker.md +++ b/docs/source/deployment/docker.md @@ -1,6 +1,6 @@ -(deploying-with-docker)= +(deployment-docker)= -# Deploying with Docker +# Using Docker ## Use vLLM's Official Docker Image diff --git a/docs/source/serving/deploying_with_bentoml.md b/docs/source/deployment/frameworks/bentoml.md similarity index 89% rename from docs/source/serving/deploying_with_bentoml.md rename to docs/source/deployment/frameworks/bentoml.md index dfa0de4f0f6d..ea0b5d1d4c93 100644 --- a/docs/source/serving/deploying_with_bentoml.md +++ b/docs/source/deployment/frameworks/bentoml.md @@ -1,6 +1,6 @@ -(deploying-with-bentoml)= +(deployment-bentoml)= -# Deploying with BentoML +# BentoML [BentoML](https://github.com/bentoml/BentoML) allows you to deploy a large language model (LLM) server with vLLM as the backend, which exposes OpenAI-compatible endpoints. You can serve the model locally or containerize it as an OCI-complicant image and deploy it on Kubernetes. diff --git a/docs/source/serving/deploying_with_cerebrium.md b/docs/source/deployment/frameworks/cerebrium.md similarity index 98% rename from docs/source/serving/deploying_with_cerebrium.md rename to docs/source/deployment/frameworks/cerebrium.md index 950064c8c1b1..be018dfb75d7 100644 --- a/docs/source/serving/deploying_with_cerebrium.md +++ b/docs/source/deployment/frameworks/cerebrium.md @@ -1,6 +1,6 @@ -(deploying-with-cerebrium)= +(deployment-cerebrium)= -# Deploying with Cerebrium +# Cerebrium ```{raw} html

diff --git a/docs/source/serving/deploying_with_dstack.md b/docs/source/deployment/frameworks/dstack.md similarity index 98% rename from docs/source/serving/deploying_with_dstack.md rename to docs/source/deployment/frameworks/dstack.md index 381f5f786ca2..4142c1d9f1f6 100644 --- a/docs/source/serving/deploying_with_dstack.md +++ b/docs/source/deployment/frameworks/dstack.md @@ -1,6 +1,6 @@ -(deploying-with-dstack)= +(deployment-dstack)= -# Deploying with dstack +# dstack ```{raw} html

diff --git a/docs/source/serving/deploying_with_helm.md b/docs/source/deployment/frameworks/helm.md similarity index 98% rename from docs/source/serving/deploying_with_helm.md rename to docs/source/deployment/frameworks/helm.md index 7286a0a88968..18ed29319146 100644 --- a/docs/source/serving/deploying_with_helm.md +++ b/docs/source/deployment/frameworks/helm.md @@ -1,6 +1,6 @@ -(deploying-with-helm)= +(deployment-helm)= -# Deploying with Helm +# Helm A Helm chart to deploy vLLM for Kubernetes @@ -38,7 +38,7 @@ chart **including persistent volumes** and deletes the release. ## Architecture -```{image} architecture_helm_deployment.png +```{image} /assets/deployment/architecture_helm_deployment.png ``` ## Values diff --git a/docs/source/deployment/frameworks/index.md b/docs/source/deployment/frameworks/index.md new file mode 100644 index 000000000000..6a59131d3661 --- /dev/null +++ b/docs/source/deployment/frameworks/index.md @@ -0,0 +1,13 @@ +# Using other frameworks + +```{toctree} +:maxdepth: 1 + +bentoml +cerebrium +dstack +helm +lws +skypilot +triton +``` diff --git a/docs/source/serving/deploying_with_lws.md b/docs/source/deployment/frameworks/lws.md similarity index 91% rename from docs/source/serving/deploying_with_lws.md rename to docs/source/deployment/frameworks/lws.md index 22bab419eaca..349fa83fbcb9 100644 --- a/docs/source/serving/deploying_with_lws.md +++ b/docs/source/deployment/frameworks/lws.md @@ -1,6 +1,6 @@ -(deploying-with-lws)= +(deployment-lws)= -# Deploying with LWS +# LWS LeaderWorkerSet (LWS) is a Kubernetes API that aims to address common deployment patterns of AI/ML inference workloads. A major use case is for multi-host/multi-node distributed inference. diff --git a/docs/source/serving/run_on_sky.md b/docs/source/deployment/frameworks/skypilot.md similarity index 99% rename from docs/source/serving/run_on_sky.md rename to docs/source/deployment/frameworks/skypilot.md index 115873ae4929..ad93534775d3 100644 --- a/docs/source/serving/run_on_sky.md +++ b/docs/source/deployment/frameworks/skypilot.md @@ -1,6 +1,6 @@ -(on-cloud)= +(deployment-skypilot)= -# Deploying and scaling up with SkyPilot +# SkyPilot ```{raw} html

diff --git a/docs/source/serving/deploying_with_triton.md b/docs/source/deployment/frameworks/triton.md similarity index 87% rename from docs/source/serving/deploying_with_triton.md rename to docs/source/deployment/frameworks/triton.md index 9b0a6f1d54ae..94d87120159c 100644 --- a/docs/source/serving/deploying_with_triton.md +++ b/docs/source/deployment/frameworks/triton.md @@ -1,5 +1,5 @@ -(deploying-with-triton)= +(deployment-triton)= -# Deploying with NVIDIA Triton +# NVIDIA Triton The [Triton Inference Server](https://github.com/triton-inference-server) hosts a tutorial demonstrating how to quickly deploy a simple [facebook/opt-125m](https://huggingface.co/facebook/opt-125m) model using vLLM. Please see [Deploying a vLLM model in Triton](https://github.com/triton-inference-server/tutorials/blob/main/Quick_Deploy/vLLM/README.md#deploying-a-vllm-model-in-triton) for more details. diff --git a/docs/source/deployment/integrations/index.md b/docs/source/deployment/integrations/index.md new file mode 100644 index 000000000000..65f17997afe2 --- /dev/null +++ b/docs/source/deployment/integrations/index.md @@ -0,0 +1,9 @@ +# External integrations + +```{toctree} +:maxdepth: 1 + +kserve +kubeai +llamastack +``` diff --git a/docs/source/serving/deploying_with_kserve.md b/docs/source/deployment/integrations/kserve.md similarity index 85% rename from docs/source/serving/deploying_with_kserve.md rename to docs/source/deployment/integrations/kserve.md index feaeb5d0ec8a..c780fd74e8f5 100644 --- a/docs/source/serving/deploying_with_kserve.md +++ b/docs/source/deployment/integrations/kserve.md @@ -1,6 +1,6 @@ -(deploying-with-kserve)= +(deployment-kserve)= -# Deploying with KServe +# KServe vLLM can be deployed with [KServe](https://github.com/kserve/kserve) on Kubernetes for highly scalable distributed model serving. diff --git a/docs/source/serving/deploying_with_kubeai.md b/docs/source/deployment/integrations/kubeai.md similarity index 93% rename from docs/source/serving/deploying_with_kubeai.md rename to docs/source/deployment/integrations/kubeai.md index 3609d7e05acd..2f5772e075d8 100644 --- a/docs/source/serving/deploying_with_kubeai.md +++ b/docs/source/deployment/integrations/kubeai.md @@ -1,6 +1,6 @@ -(deploying-with-kubeai)= +(deployment-kubeai)= -# Deploying with KubeAI +# KubeAI [KubeAI](https://github.com/substratusai/kubeai) is a Kubernetes operator that enables you to deploy and manage AI models on Kubernetes. It provides a simple and scalable way to deploy vLLM in production. Functionality such as scale-from-zero, load based autoscaling, model caching, and much more is provided out of the box with zero external dependencies. diff --git a/docs/source/serving/serving_with_llamastack.md b/docs/source/deployment/integrations/llamastack.md similarity index 95% rename from docs/source/serving/serving_with_llamastack.md rename to docs/source/deployment/integrations/llamastack.md index 71dadca7ad47..474d2bdfa958 100644 --- a/docs/source/serving/serving_with_llamastack.md +++ b/docs/source/deployment/integrations/llamastack.md @@ -1,6 +1,6 @@ -(run-on-llamastack)= +(deployment-llamastack)= -# Serving with Llama Stack +# Llama Stack vLLM is also available via [Llama Stack](https://github.com/meta-llama/llama-stack) . diff --git a/docs/source/serving/deploying_with_k8s.md b/docs/source/deployment/k8s.md similarity index 99% rename from docs/source/serving/deploying_with_k8s.md rename to docs/source/deployment/k8s.md index 5f9b0e4f55ec..a7d796091b06 100644 --- a/docs/source/serving/deploying_with_k8s.md +++ b/docs/source/deployment/k8s.md @@ -1,6 +1,6 @@ -(deploying-with-k8s)= +(deployment-k8s)= -# Deploying with Kubernetes +# Using Kubernetes Using Kubernetes to deploy vLLM is a scalable and efficient way to serve machine learning models. This guide will walk you through the process of deploying vLLM with Kubernetes, including the necessary prerequisites, steps for deployment, and testing. @@ -43,7 +43,7 @@ metadata: name: hf-token-secret namespace: default type: Opaque -stringData: +data: token: "REPLACE_WITH_TOKEN" ``` diff --git a/docs/source/serving/deploying_with_nginx.md b/docs/source/deployment/nginx.md similarity index 99% rename from docs/source/serving/deploying_with_nginx.md rename to docs/source/deployment/nginx.md index a1f00d853646..a58f791c2997 100644 --- a/docs/source/serving/deploying_with_nginx.md +++ b/docs/source/deployment/nginx.md @@ -1,6 +1,6 @@ (nginxloadbalancer)= -# Deploying with Nginx Loadbalancer +# Using Nginx This document shows how to launch multiple vLLM serving containers and use Nginx to act as a load balancer between the servers. diff --git a/docs/source/getting_started/installation/hpu-gaudi.md b/docs/source/getting_started/installation/hpu-gaudi.md index 94de169f51a7..1d50cef3bdc8 100644 --- a/docs/source/getting_started/installation/hpu-gaudi.md +++ b/docs/source/getting_started/installation/hpu-gaudi.md @@ -82,7 +82,7 @@ $ python setup.py develop ## Supported Features -- [Offline batched inference](#offline-batched-inference) +- [Offline inference](#offline-inference) - Online inference via [OpenAI-Compatible Server](#openai-compatible-server) - HPU autodetection - no need to manually select device within vLLM - Paged KV cache with algorithms enabled for Intel Gaudi accelerators diff --git a/docs/source/getting_started/quickstart.md b/docs/source/getting_started/quickstart.md index ff216f8af30f..a69f77d9a831 100644 --- a/docs/source/getting_started/quickstart.md +++ b/docs/source/getting_started/quickstart.md @@ -2,20 +2,20 @@ # Quickstart -This guide will help you quickly get started with vLLM to: +This guide will help you quickly get started with vLLM to perform: -- [Run offline batched inference](#offline-batched-inference) -- [Run OpenAI-compatible inference](#openai-compatible-server) +- [Offline batched inference](#quickstart-offline) +- [Online inference using OpenAI-compatible server](#quickstart-online) ## Prerequisites - OS: Linux - Python: 3.9 -- 3.12 -- GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, L4, H100, etc.) ## Installation -You can install vLLM using pip. It's recommended to use [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html) to create and manage Python environments. +If you are using NVIDIA GPUs, you can install vLLM using [pip](https://pypi.org/project/vllm/) directly. +It's recommended to use [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html) to create and manage Python environments. ```console $ conda create -n myenv python=3.10 -y @@ -23,11 +23,13 @@ $ conda activate myenv $ pip install vllm ``` -Please refer to the [installation documentation](#installation-index) for more details on installing vLLM. +```{note} +For non-CUDA platforms, please refer [here](#installation-index) for specific instructions on how to install vLLM. +``` -(offline-batched-inference)= +(quickstart-offline)= -## Offline Batched Inference +## Offline batched inference With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing). See the example script: @@ -73,9 +75,9 @@ for output in outputs: print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` -(openai-compatible-server)= +(quickstart-online)= -## OpenAI-Compatible Server +## OpenAI-compatible server vLLM can be deployed as a server that implements the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API. By default, it starts the server at `http://localhost:8000`. You can specify the address with `--host` and `--port` arguments. The server currently hosts one model at a time and implements endpoints such as [list models](https://platform.openai.com/docs/api-reference/models/list), [create chat completion](https://platform.openai.com/docs/api-reference/chat/completions/create), and [create completion](https://platform.openai.com/docs/api-reference/completions/create) endpoints. diff --git a/docs/source/index.md b/docs/source/index.md index f39047497879..2ce5135174d8 100644 --- a/docs/source/index.md +++ b/docs/source/index.md @@ -66,19 +66,26 @@ getting_started/faq ``` ```{toctree} -:caption: Serving +:caption: Inference and Serving :maxdepth: 1 +serving/offline_inference serving/openai_compatible_server -serving/deploying_with_docker -serving/deploying_with_k8s -serving/deploying_with_helm -serving/deploying_with_nginx serving/distributed_serving serving/metrics -serving/integrations -serving/tensorizer -serving/runai_model_streamer +serving/integrations/index +serving/multimodal_inputs +``` + +```{toctree} +:caption: Deployment +:maxdepth: 1 + +deployment/docker +deployment/k8s +deployment/nginx +deployment/frameworks/index +deployment/integrations/index ``` ```{toctree} @@ -90,6 +97,7 @@ models/generative_models models/pooling_models models/adding_model models/enabling_multimodal_inputs +models/loaders/index ``` ```{toctree} @@ -97,7 +105,6 @@ models/enabling_multimodal_inputs :maxdepth: 1 usage/lora -usage/multimodal_inputs usage/tool_calling usage/structured_outputs usage/spec_decode diff --git a/docs/source/models/loaders/index.md b/docs/source/models/loaders/index.md new file mode 100644 index 000000000000..46d6ca9c0978 --- /dev/null +++ b/docs/source/models/loaders/index.md @@ -0,0 +1,8 @@ +# Alternative model loaders + +```{toctree} +:maxdepth: 1 + +runai_model_streamer +tensorizer +``` diff --git a/docs/source/serving/runai_model_streamer.md b/docs/source/models/loaders/runai_model_streamer.md similarity index 98% rename from docs/source/serving/runai_model_streamer.md rename to docs/source/models/loaders/runai_model_streamer.md index d4269050ff57..74e18a664558 100644 --- a/docs/source/serving/runai_model_streamer.md +++ b/docs/source/models/loaders/runai_model_streamer.md @@ -1,6 +1,6 @@ (runai-model-streamer)= -# Loading Models with Run:ai Model Streamer +# Run:ai Model Streamer Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in [Run:ai Model Streamer Documentation](https://github.com/run-ai/runai-model-streamer/blob/master/docs/README.md). diff --git a/docs/source/serving/tensorizer.md b/docs/source/models/loaders/tensorizer.md similarity index 95% rename from docs/source/serving/tensorizer.md rename to docs/source/models/loaders/tensorizer.md index d3dd29d48f73..7168237cff22 100644 --- a/docs/source/serving/tensorizer.md +++ b/docs/source/models/loaders/tensorizer.md @@ -1,6 +1,6 @@ (tensorizer)= -# Loading Models with CoreWeave's Tensorizer +# Tensorizer vLLM supports loading models with [CoreWeave's Tensorizer](https://docs.coreweave.com/coreweave-machine-learning-and-ai/inference/tensorizer). vLLM model tensors that have been serialized to disk, an HTTP/HTTPS endpoint, or S3 endpoint can be deserialized diff --git a/docs/source/serving/integrations.md b/docs/source/serving/integrations.md deleted file mode 100644 index d214c7725425..000000000000 --- a/docs/source/serving/integrations.md +++ /dev/null @@ -1,17 +0,0 @@ -# Integrations - -```{toctree} -:maxdepth: 1 - -run_on_sky -deploying_with_kserve -deploying_with_kubeai -deploying_with_triton -deploying_with_bentoml -deploying_with_cerebrium -deploying_with_lws -deploying_with_dstack -serving_with_langchain -serving_with_llamaindex -serving_with_llamastack -``` diff --git a/docs/source/serving/integrations/index.md b/docs/source/serving/integrations/index.md new file mode 100644 index 000000000000..257cf9c5081a --- /dev/null +++ b/docs/source/serving/integrations/index.md @@ -0,0 +1,8 @@ +# External integrations + +```{toctree} +:maxdepth: 1 + +langchain +llamaindex +``` diff --git a/docs/source/serving/serving_with_langchain.md b/docs/source/serving/integrations/langchain.md similarity index 82% rename from docs/source/serving/serving_with_langchain.md rename to docs/source/serving/integrations/langchain.md index 96bd5943f3d6..49ff6e0c32a7 100644 --- a/docs/source/serving/serving_with_langchain.md +++ b/docs/source/serving/integrations/langchain.md @@ -1,10 +1,10 @@ -(run-on-langchain)= +(serving-langchain)= -# Serving with Langchain +# LangChain -vLLM is also available via [Langchain](https://github.com/langchain-ai/langchain) . +vLLM is also available via [LangChain](https://github.com/langchain-ai/langchain) . -To install langchain, run +To install LangChain, run ```console $ pip install langchain langchain_community -q diff --git a/docs/source/serving/serving_with_llamaindex.md b/docs/source/serving/integrations/llamaindex.md similarity index 74% rename from docs/source/serving/serving_with_llamaindex.md rename to docs/source/serving/integrations/llamaindex.md index 98859d8e3f82..9961c181d7e1 100644 --- a/docs/source/serving/serving_with_llamaindex.md +++ b/docs/source/serving/integrations/llamaindex.md @@ -1,10 +1,10 @@ -(run-on-llamaindex)= +(serving-llamaindex)= -# Serving with llama_index +# LlamaIndex -vLLM is also available via [llama_index](https://github.com/run-llama/llama_index) . +vLLM is also available via [LlamaIndex](https://github.com/run-llama/llama_index) . -To install llamaindex, run +To install LlamaIndex, run ```console $ pip install llama-index-llms-vllm -q diff --git a/docs/source/serving/metrics.md b/docs/source/serving/metrics.md index 2dc78643f6d8..e6ded2e6dd46 100644 --- a/docs/source/serving/metrics.md +++ b/docs/source/serving/metrics.md @@ -4,7 +4,7 @@ vLLM exposes a number of metrics that can be used to monitor the health of the system. These metrics are exposed via the `/metrics` endpoint on the vLLM OpenAI compatible API server. -You can start the server using Python, or using [Docker](deploying_with_docker.md): +You can start the server using Python, or using [Docker](#deployment-docker): ```console $ vllm serve unsloth/Llama-3.2-1B-Instruct diff --git a/docs/source/usage/multimodal_inputs.md b/docs/source/serving/multimodal_inputs.md similarity index 100% rename from docs/source/usage/multimodal_inputs.md rename to docs/source/serving/multimodal_inputs.md diff --git a/docs/source/serving/offline_inference.md b/docs/source/serving/offline_inference.md new file mode 100644 index 000000000000..0c8f90ac9cc9 --- /dev/null +++ b/docs/source/serving/offline_inference.md @@ -0,0 +1,79 @@ +(offline-inference)= + +# Offline inference + +You can run vLLM in your own code on a list of prompts. + +The offline API is based on the {class}`~vllm.LLM` class. +To initialize the vLLM engine, create a new instance of `LLM` and specify the model to run. + +For example, the following code downloads the [`facebook/opt-125m`](https://huggingface.co/facebook/opt-125m) model from HuggingFace +and runs it in vLLM using the default configuration. + +```python +llm = LLM(model="facebook/opt-125m") +``` + +After initializing the `LLM` instance, you can perform model inference using various APIs. +The available APIs depend on the type of model that is being run: + +- [Generative models](#generative-models) output logprobs which are sampled from to obtain the final output text. +- [Pooling models](#pooling-models) output their hidden states directly. + +Please refer to the above pages for more details about each API. + +```{seealso} +[API Reference](/dev/offline_inference/offline_index) +``` + +## Configuration options + +This section lists the most common options for running the vLLM engine. +For a full list, refer to the [Engine Arguments](#engine-args) page. + +### Reducing memory usage + +Large models might cause your machine to run out of memory (OOM). Here are some options that help alleviate this problem. + +#### Tensor Parallelism (TP) + +Tensor parallelism (`tensor_parallel_size` option) can be used to split the model across multiple GPUs. + +The following code splits the model across 2 GPUs. + +```python +llm = LLM(model="ibm-granite/granite-3.1-8b-instruct", + tensor_parallel_size=2) +``` + +```{important} +To ensure that vLLM initializes CUDA correctly, you should avoid calling related functions (e.g. {func}`torch.cuda.set_device`) +before initializing vLLM. Otherwise, you may run into an error like `RuntimeError: Cannot re-initialize CUDA in forked subprocess`. + +To control which devices are used, please instead set the `CUDA_VISIBLE_DEVICES` environment variable. +``` + +#### Quantization + +Quantized models take less memory at the cost of lower precision. + +Statically quantized models can be downloaded from HF Hub (some popular ones are available at [Neural Magic](https://huggingface.co/neuralmagic)) +and used directly without extra configuration. + +Dynamic quantization is also supported via the `quantization` option -- see [here](#quantization-index) for more details. + +#### Context length and batch size + +You can further reduce memory usage by limit the context length of the model (`max_model_len` option) +and the maximum batch size (`max_num_seqs` option). + +```python +llm = LLM(model="adept/fuyu-8b", + max_model_len=2048, + max_num_seqs=2) +``` + +### Performance optimization and tuning + +You can potentially improve the performance of vLLM by finetuning various options. +Please refer to [this guide](#optimization-and-tuning) for more details. diff --git a/docs/source/serving/openai_compatible_server.md b/docs/source/serving/openai_compatible_server.md index caf5e8cafd9a..9ac4c031c46e 100644 --- a/docs/source/serving/openai_compatible_server.md +++ b/docs/source/serving/openai_compatible_server.md @@ -1,8 +1,10 @@ -# OpenAI Compatible Server +(openai-compatible-server)= -vLLM provides an HTTP server that implements OpenAI's [Completions](https://platform.openai.com/docs/api-reference/completions) and [Chat](https://platform.openai.com/docs/api-reference/chat) API, and more! +# OpenAI-compatible server -You can start the server via the [`vllm serve`](#vllm-serve) command, or through [Docker](deploying_with_docker.md): +vLLM provides an HTTP server that implements OpenAI's [Completions API](https://platform.openai.com/docs/api-reference/completions), [Chat API](https://platform.openai.com/docs/api-reference/chat), and more! + +You can start the server via the [`vllm serve`](#vllm-serve) command, or through [Docker](#deployment-docker): ```bash vllm serve NousResearch/Meta-Llama-3-8B-Instruct --dtype auto --api-key token-abc123 ``` @@ -217,7 +219,7 @@ you can use the [official OpenAI Python client](https://github.com/openai/openai We support both [Vision](https://platform.openai.com/docs/guides/vision)- and [Audio](https://platform.openai.com/docs/guides/audio?audio-generation-quickstart-example=audio-in)-related parameters; -see our [Multimodal Inputs](../usage/multimodal_inputs.md) guide for more information. +see our [Multimodal Inputs](#multimodal-inputs) guide for more information. - *Note: `image_url.detail` parameter is not supported.* Code example: From 905ef01e7c6c9c15c9a7941150d1f9fbfe84a0a8 Mon Sep 17 00:00:00 2001 From: DarkLight1337 Date: Mon, 6 Jan 2025 09:47:07 +0000 Subject: [PATCH 02/11] Apply #11679 Signed-off-by: DarkLight1337 --- docs/source/deployment/k8s.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/deployment/k8s.md b/docs/source/deployment/k8s.md index a7d796091b06..760214e112fb 100644 --- a/docs/source/deployment/k8s.md +++ b/docs/source/deployment/k8s.md @@ -43,7 +43,7 @@ metadata: name: hf-token-secret namespace: default type: Opaque -data: +stringData: token: "REPLACE_WITH_TOKEN" ``` From 886683603b2e6dd43714c22b0417619380a96754 Mon Sep 17 00:00:00 2001 From: DarkLight1337 Date: Mon, 6 Jan 2025 13:48:46 +0000 Subject: [PATCH 03/11] Reorder pages Signed-off-by: DarkLight1337 --- docs/source/index.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/index.md b/docs/source/index.md index 0734887ce1c2..3f2f056038c9 100644 --- a/docs/source/index.md +++ b/docs/source/index.md @@ -71,13 +71,13 @@ getting_started/faq serving/offline_inference serving/openai_compatible_server +serving/multimodal_inputs serving/distributed_serving serving/metrics -serving/integrations/index -serving/multimodal_inputs serving/engine_args serving/env_vars serving/usage_stats +serving/integrations/index ``` ```{toctree} From 4f66ed82746b41d2f3705ed54f1bd9b6c9942aab Mon Sep 17 00:00:00 2001 From: DarkLight1337 Date: Mon, 6 Jan 2025 14:06:16 +0000 Subject: [PATCH 04/11] Update headers and remove unnecessary code directives Signed-off-by: DarkLight1337 --- docs/source/deployment/frameworks/skypilot.md | 4 ++-- docs/source/deployment/integrations/index.md | 2 +- docs/source/design/arch_overview.md | 2 +- docs/source/getting_started/installation/gpu-rocm.md | 2 +- docs/source/getting_started/quickstart.md | 4 ++-- docs/source/serving/distributed_serving.md | 12 ++++++------ docs/source/serving/integrations/index.md | 2 +- docs/source/serving/multimodal_inputs.md | 12 ++++++------ docs/source/serving/offline_inference.md | 4 ++-- docs/source/serving/openai_compatible_server.md | 2 +- docs/source/serving/usage_stats.md | 2 +- 11 files changed, 24 insertions(+), 24 deletions(-) diff --git a/docs/source/deployment/frameworks/skypilot.md b/docs/source/deployment/frameworks/skypilot.md index ad93534775d3..f02a94302692 100644 --- a/docs/source/deployment/frameworks/skypilot.md +++ b/docs/source/deployment/frameworks/skypilot.md @@ -12,9 +12,9 @@ vLLM can be **run and scaled to multiple service replicas on clouds and Kubernet ## Prerequisites -- Go to the [HuggingFace model page](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) and request access to the model {code}`meta-llama/Meta-Llama-3-8B-Instruct`. +- Go to the [HuggingFace model page](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) and request access to the model `meta-llama/Meta-Llama-3-8B-Instruct`. - Check that you have installed SkyPilot ([docs](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html)). -- Check that {code}`sky check` shows clouds or Kubernetes are enabled. +- Check that `sky check` shows clouds or Kubernetes are enabled. ```console pip install skypilot-nightly diff --git a/docs/source/deployment/integrations/index.md b/docs/source/deployment/integrations/index.md index 65f17997afe2..d47ede896754 100644 --- a/docs/source/deployment/integrations/index.md +++ b/docs/source/deployment/integrations/index.md @@ -1,4 +1,4 @@ -# External integrations +# External Integrations ```{toctree} :maxdepth: 1 diff --git a/docs/source/design/arch_overview.md b/docs/source/design/arch_overview.md index 2f1280c04767..5e0dd021ad02 100644 --- a/docs/source/design/arch_overview.md +++ b/docs/source/design/arch_overview.md @@ -57,7 +57,7 @@ More API details can be found in the {doc}`Offline Inference The code for the `LLM` class can be found in . -### OpenAI-compatible API server +### OpenAI-Compatible API Server The second primary interface to vLLM is via its OpenAI-compatible API server. This server can be started using the `vllm serve` command. diff --git a/docs/source/getting_started/installation/gpu-rocm.md b/docs/source/getting_started/installation/gpu-rocm.md index 796911d7305a..e36b92513e31 100644 --- a/docs/source/getting_started/installation/gpu-rocm.md +++ b/docs/source/getting_started/installation/gpu-rocm.md @@ -148,7 +148,7 @@ $ export PYTORCH_ROCM_ARCH="gfx90a;gfx942" $ python3 setup.py develop ``` -This may take 5-10 minutes. Currently, {code}`pip install .` does not work for ROCm installation. +This may take 5-10 minutes. Currently, `pip install .` does not work for ROCm installation. ```{tip} - Triton flash attention is used by default. For benchmarking purposes, it is recommended to run a warm up step before collecting perf numbers. diff --git a/docs/source/getting_started/quickstart.md b/docs/source/getting_started/quickstart.md index a69f77d9a831..3f9556165ece 100644 --- a/docs/source/getting_started/quickstart.md +++ b/docs/source/getting_started/quickstart.md @@ -29,7 +29,7 @@ For non-CUDA platforms, please refer [here](#installation-index) for specific in (quickstart-offline)= -## Offline batched inference +## Offline Batched Inference With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing). See the example script: @@ -77,7 +77,7 @@ for output in outputs: (quickstart-online)= -## OpenAI-compatible server +## OpenAI-Compatible Server vLLM can be deployed as a server that implements the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API. By default, it starts the server at `http://localhost:8000`. You can specify the address with `--host` and `--port` arguments. The server currently hosts one model at a time and implements endpoints such as [list models](https://platform.openai.com/docs/api-reference/models/list), [create chat completion](https://platform.openai.com/docs/api-reference/chat/completions/create), and [create completion](https://platform.openai.com/docs/api-reference/completions/create) endpoints. diff --git a/docs/source/serving/distributed_serving.md b/docs/source/serving/distributed_serving.md index 6fbc1ea10467..b1703249d722 100644 --- a/docs/source/serving/distributed_serving.md +++ b/docs/source/serving/distributed_serving.md @@ -18,13 +18,13 @@ After adding enough GPUs and nodes to hold the model, you can run vLLM first, wh There is one edge case: if the model fits in a single node with multiple GPUs, but the number of GPUs cannot divide the model size evenly, you can use pipeline parallelism, which splits the model along layers and supports uneven splits. In this case, the tensor parallel size should be 1 and the pipeline parallel size should be the number of GPUs. ``` -## Details for Distributed Inference and Serving +## Running vLLM on a single node vLLM supports distributed tensor-parallel and pipeline-parallel inference and serving. Currently, we support [Megatron-LM's tensor parallel algorithm](https://arxiv.org/pdf/1909.08053.pdf). We manage the distributed runtime with either [Ray](https://github.com/ray-project/ray) or python native multiprocessing. Multiprocessing can be used when deploying on a single node, multi-node inferencing currently requires Ray. -Multiprocessing will be used by default when not running in a Ray placement group and if there are sufficient GPUs available on the same node for the configured {code}`tensor_parallel_size`, otherwise Ray will be used. This default can be overridden via the {code}`LLM` class {code}`distributed_executor_backend` argument or {code}`--distributed-executor-backend` API server argument. Set it to {code}`mp` for multiprocessing or {code}`ray` for Ray. It's not required for Ray to be installed for the multiprocessing case. +Multiprocessing will be used by default when not running in a Ray placement group and if there are sufficient GPUs available on the same node for the configured `tensor_parallel_size`, otherwise Ray will be used. This default can be overridden via the `LLM` class `distributed_executor_backend` argument or `--distributed-executor-backend` API server argument. Set it to `mp` for multiprocessing or `ray` for Ray. It's not required for Ray to be installed for the multiprocessing case. -To run multi-GPU inference with the {code}`LLM` class, set the {code}`tensor_parallel_size` argument to the number of GPUs you want to use. For example, to run inference on 4 GPUs: +To run multi-GPU inference with the `LLM` class, set the `tensor_parallel_size` argument to the number of GPUs you want to use. For example, to run inference on 4 GPUs: ```python from vllm import LLM @@ -32,14 +32,14 @@ llm = LLM("facebook/opt-13b", tensor_parallel_size=4) output = llm.generate("San Franciso is a") ``` -To run multi-GPU serving, pass in the {code}`--tensor-parallel-size` argument when starting the server. For example, to run API server on 4 GPUs: +To run multi-GPU serving, pass in the `--tensor-parallel-size` argument when starting the server. For example, to run API server on 4 GPUs: ```console $ vllm serve facebook/opt-13b \ $ --tensor-parallel-size 4 ``` -You can also additionally specify {code}`--pipeline-parallel-size` to enable pipeline parallelism. For example, to run API server on 8 GPUs with pipeline parallelism and tensor parallelism: +You can also additionally specify `--pipeline-parallel-size` to enable pipeline parallelism. For example, to run API server on 8 GPUs with pipeline parallelism and tensor parallelism: ```console $ vllm serve gpt2 \ @@ -47,7 +47,7 @@ $ --tensor-parallel-size 4 \ $ --pipeline-parallel-size 2 ``` -## Multi-Node Inference and Serving +## Running vLLM on multiple nodes If a single node does not have enough GPUs to hold the model, you can run the model using multiple nodes. It is important to make sure the execution environment is the same on all nodes, including the model path, the Python environment. The recommended way is to use docker images to ensure the same environment, and hide the heterogeneity of the host machines via mapping them into the same docker configuration. diff --git a/docs/source/serving/integrations/index.md b/docs/source/serving/integrations/index.md index 257cf9c5081a..371c284981ce 100644 --- a/docs/source/serving/integrations/index.md +++ b/docs/source/serving/integrations/index.md @@ -1,4 +1,4 @@ -# External integrations +# External Integrations ```{toctree} :maxdepth: 1 diff --git a/docs/source/serving/multimodal_inputs.md b/docs/source/serving/multimodal_inputs.md index 4f45a9f448cf..0efa09f2869c 100644 --- a/docs/source/serving/multimodal_inputs.md +++ b/docs/source/serving/multimodal_inputs.md @@ -18,7 +18,7 @@ To input multi-modal data, follow this schema in {class}`vllm.inputs.PromptType` ### Image -You can pass a single image to the {code}`'image'` field of the multi-modal dictionary, as shown in the following examples: +You can pass a single image to the `'image'` field of the multi-modal dictionary, as shown in the following examples: ```python llm = LLM(model="llava-hf/llava-1.5-7b-hf") @@ -122,21 +122,21 @@ for o in outputs: ### Video -You can pass a list of NumPy arrays directly to the {code}`'video'` field of the multi-modal dictionary +You can pass a list of NumPy arrays directly to the `'video'` field of the multi-modal dictionary instead of using multi-image input. Full example: ### Audio -You can pass a tuple {code}`(array, sampling_rate)` to the {code}`'audio'` field of the multi-modal dictionary. +You can pass a tuple `(array, sampling_rate)` to the `'audio'` field of the multi-modal dictionary. Full example: ### Embedding To input pre-computed embeddings belonging to a data type (i.e. image, video, or audio) directly to the language model, -pass a tensor of shape {code}`(num_items, feature_size, hidden_size of LM)` to the corresponding field of the multi-modal dictionary. +pass a tensor of shape `(num_items, feature_size, hidden_size of LM)` to the corresponding field of the multi-modal dictionary. ```python # Inference with image embeddings as input @@ -294,7 +294,7 @@ $ export VLLM_IMAGE_FETCH_TIMEOUT= ### Video -Instead of {code}`image_url`, you can pass a video file via {code}`video_url`. Here is a simple example using [LLaVA-OneVision](https://huggingface.co/llava-hf/llava-onevision-qwen2-0.5b-ov-hf). +Instead of `image_url`, you can pass a video file via `video_url`. Here is a simple example using [LLaVA-OneVision](https://huggingface.co/llava-hf/llava-onevision-qwen2-0.5b-ov-hf). First, launch the OpenAI-compatible server: @@ -418,7 +418,7 @@ result = chat_completion_from_base64.choices[0].message.content print("Chat completion output from input audio:", result) ``` -Alternatively, you can pass {code}`audio_url`, which is the audio counterpart of {code}`image_url` for image input: +Alternatively, you can pass `audio_url`, which is the audio counterpart of `image_url` for image input: ```python chat_completion_from_url = client.chat.completions.create( diff --git a/docs/source/serving/offline_inference.md b/docs/source/serving/offline_inference.md index 0c8f90ac9cc9..83178f781182 100644 --- a/docs/source/serving/offline_inference.md +++ b/docs/source/serving/offline_inference.md @@ -1,6 +1,6 @@ (offline-inference)= -# Offline inference +# Offline Inference You can run vLLM in your own code on a list of prompts. @@ -26,7 +26,7 @@ Please refer to the above pages for more details about each API. [API Reference](/dev/offline_inference/offline_index) ``` -## Configuration options +## Configuration Options This section lists the most common options for running the vLLM engine. For a full list, refer to the [Engine Arguments](#engine-args) page. diff --git a/docs/source/serving/openai_compatible_server.md b/docs/source/serving/openai_compatible_server.md index 9ac4c031c46e..1e5ea6357d20 100644 --- a/docs/source/serving/openai_compatible_server.md +++ b/docs/source/serving/openai_compatible_server.md @@ -1,6 +1,6 @@ (openai-compatible-server)= -# OpenAI-compatible server +# OpenAI-Compatible Server vLLM provides an HTTP server that implements OpenAI's [Completions API](https://platform.openai.com/docs/api-reference/completions), [Chat API](https://platform.openai.com/docs/api-reference/chat), and more! diff --git a/docs/source/serving/usage_stats.md b/docs/source/serving/usage_stats.md index 3d02fbab9216..cfc3cb257687 100644 --- a/docs/source/serving/usage_stats.md +++ b/docs/source/serving/usage_stats.md @@ -45,7 +45,7 @@ You can preview the collected data by running the following command: tail ~/.config/vllm/usage_stats.json ``` -## Opt-out of Usage Stats Collection +## Opting out You can opt-out of usage stats collection by setting the `VLLM_NO_USAGE_STATS` or `DO_NOT_TRACK` environment variable, or by creating a `~/.config/vllm/do_not_track` file: From 9a52dd1d4490a655df87aaa4b5a4d6b91654e7c6 Mon Sep 17 00:00:00 2001 From: DarkLight1337 Date: Mon, 6 Jan 2025 14:07:25 +0000 Subject: [PATCH 05/11] Rename Signed-off-by: DarkLight1337 --- docs/source/models/loaders/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/models/loaders/index.md b/docs/source/models/loaders/index.md index 46d6ca9c0978..fba51a662728 100644 --- a/docs/source/models/loaders/index.md +++ b/docs/source/models/loaders/index.md @@ -1,4 +1,4 @@ -# Alternative model loaders +# Alternative Model Loaders ```{toctree} :maxdepth: 1 From 1f93b4117a056a62f697e6a2f6ee5237360a6dff Mon Sep 17 00:00:00 2001 From: DarkLight1337 Date: Mon, 6 Jan 2025 14:20:43 +0000 Subject: [PATCH 06/11] Move Models and Features up Signed-off-by: DarkLight1337 --- docs/source/index.md | 48 ++++++++++++++++++++++---------------------- 1 file changed, 24 insertions(+), 24 deletions(-) diff --git a/docs/source/index.md b/docs/source/index.md index 3f2f056038c9..c44e29ec3cc7 100644 --- a/docs/source/index.md +++ b/docs/source/index.md @@ -65,6 +65,30 @@ getting_started/troubleshooting getting_started/faq ``` +```{toctree} +:caption: Models +:maxdepth: 1 + +models/supported_models +models/generative_models +models/pooling_models +models/loaders/index +``` + +```{toctree} +:caption: Features +:maxdepth: 1 + +features/quantization/index +features/lora +features/tool_calling +features/structured_outputs +features/automatic_prefix_caching +features/disagg_prefill +features/spec_decode +features/compatibility_matrix +``` + ```{toctree} :caption: Inference and Serving :maxdepth: 1 @@ -91,30 +115,6 @@ deployment/frameworks/index deployment/integrations/index ``` -```{toctree} -:caption: Models -:maxdepth: 1 - -models/supported_models -models/generative_models -models/pooling_models -models/loaders/index -``` - -```{toctree} -:caption: Features -:maxdepth: 1 - -features/quantization/index -features/lora -features/tool_calling -features/structured_outputs -features/automatic_prefix_caching -features/disagg_prefill -features/spec_decode -features/compatibility_matrix -``` - ```{toctree} :caption: Performance :maxdepth: 1 From 25ae748f1045aefd41b18bce2cbe6768a889ee3e Mon Sep 17 00:00:00 2001 From: DarkLight1337 Date: Mon, 6 Jan 2025 14:23:19 +0000 Subject: [PATCH 07/11] Move model loaders to features Signed-off-by: DarkLight1337 --- docs/source/{models/loaders => features/model_loaders}/index.md | 0 .../loaders => features/model_loaders}/runai_model_streamer.md | 0 .../{models/loaders => features/model_loaders}/tensorizer.md | 0 docs/source/index.md | 2 +- 4 files changed, 1 insertion(+), 1 deletion(-) rename docs/source/{models/loaders => features/model_loaders}/index.md (100%) rename docs/source/{models/loaders => features/model_loaders}/runai_model_streamer.md (100%) rename docs/source/{models/loaders => features/model_loaders}/tensorizer.md (100%) diff --git a/docs/source/models/loaders/index.md b/docs/source/features/model_loaders/index.md similarity index 100% rename from docs/source/models/loaders/index.md rename to docs/source/features/model_loaders/index.md diff --git a/docs/source/models/loaders/runai_model_streamer.md b/docs/source/features/model_loaders/runai_model_streamer.md similarity index 100% rename from docs/source/models/loaders/runai_model_streamer.md rename to docs/source/features/model_loaders/runai_model_streamer.md diff --git a/docs/source/models/loaders/tensorizer.md b/docs/source/features/model_loaders/tensorizer.md similarity index 100% rename from docs/source/models/loaders/tensorizer.md rename to docs/source/features/model_loaders/tensorizer.md diff --git a/docs/source/index.md b/docs/source/index.md index c44e29ec3cc7..4270b44141e0 100644 --- a/docs/source/index.md +++ b/docs/source/index.md @@ -72,13 +72,13 @@ getting_started/faq models/supported_models models/generative_models models/pooling_models -models/loaders/index ``` ```{toctree} :caption: Features :maxdepth: 1 +features/model_loaders/index features/quantization/index features/lora features/tool_calling From b53d87bdaf15b9d7334453812254a9fd883e5dfc Mon Sep 17 00:00:00 2001 From: DarkLight1337 Date: Mon, 6 Jan 2025 14:26:49 +0000 Subject: [PATCH 08/11] Update headers Signed-off-by: DarkLight1337 --- docs/source/features/disagg_prefill.md | 8 ++++++-- docs/source/features/spec_decode.md | 2 +- 2 files changed, 7 insertions(+), 3 deletions(-) diff --git a/docs/source/features/disagg_prefill.md b/docs/source/features/disagg_prefill.md index 05226f2dec87..645dc60807dd 100644 --- a/docs/source/features/disagg_prefill.md +++ b/docs/source/features/disagg_prefill.md @@ -1,8 +1,12 @@ (disagg-prefill)= -# Disaggregated prefilling (experimental) +# Disaggregated Prefilling (experimental) -This page introduces you the disaggregated prefilling feature in vLLM. This feature is experimental and subject to change. +This page introduces you the disaggregated prefilling feature in vLLM. + +```{note} +This feature is experimental and subject to change. +``` ## Why disaggregated prefilling? diff --git a/docs/source/features/spec_decode.md b/docs/source/features/spec_decode.md index 8c52c97a41e4..bc8a0aa14dc5 100644 --- a/docs/source/features/spec_decode.md +++ b/docs/source/features/spec_decode.md @@ -1,6 +1,6 @@ (spec-decode)= -# Speculative decoding +# Speculative Decoding ```{warning} Please note that speculative decoding in vLLM is not yet optimized and does From 2793b0885b80a796f3af11749d7cd7241289b98b Mon Sep 17 00:00:00 2001 From: DarkLight1337 Date: Mon, 6 Jan 2025 14:44:24 +0000 Subject: [PATCH 09/11] Rearrange Signed-off-by: DarkLight1337 --- docs/source/contributing/model/registration.md | 4 ++-- docs/source/index.md | 4 ++-- .../{features/model_loaders => models/extensions}/index.md | 2 +- .../extensions}/runai_model_streamer.md | 2 +- .../model_loaders => models/extensions}/tensorizer.md | 2 +- docs/source/models/supported_models.md | 2 +- 6 files changed, 8 insertions(+), 8 deletions(-) rename docs/source/{features/model_loaders => models/extensions}/index.md (69%) rename docs/source/{features/model_loaders => models/extensions}/runai_model_streamer.md (98%) rename docs/source/{features/model_loaders => models/extensions}/tensorizer.md (95%) diff --git a/docs/source/contributing/model/registration.md b/docs/source/contributing/model/registration.md index cf1cdb0c9de0..fe5aa94c5289 100644 --- a/docs/source/contributing/model/registration.md +++ b/docs/source/contributing/model/registration.md @@ -3,7 +3,7 @@ # Model Registration vLLM relies on a model registry to determine how to run each model. -A list of pre-registered architectures can be found on the [Supported Models](#supported-models) page. +A list of pre-registered architectures can be found [here](#supported-models). If your model is not on this list, you must register it to vLLM. This page provides detailed instructions on how to do so. @@ -16,7 +16,7 @@ This gives you the ability to modify the codebase and test your model. After you have implemented your model (see [tutorial](#new-model-basic)), put it into the directory. Then, add your model class to `_VLLM_MODELS` in so that it is automatically registered upon importing vLLM. You should also include an example HuggingFace repository for this model in to run the unit tests. -Finally, update the [Supported Models](#supported-models) documentation page to promote your model! +Finally, update our [list of supported models](#supported-models) to promote your model! ```{important} The list of models in each section should be maintained in alphabetical order. diff --git a/docs/source/index.md b/docs/source/index.md index 4270b44141e0..c335155bd6e1 100644 --- a/docs/source/index.md +++ b/docs/source/index.md @@ -69,16 +69,16 @@ getting_started/faq :caption: Models :maxdepth: 1 -models/supported_models models/generative_models models/pooling_models +models/supported_models +models/extensions/index ``` ```{toctree} :caption: Features :maxdepth: 1 -features/model_loaders/index features/quantization/index features/lora features/tool_calling diff --git a/docs/source/features/model_loaders/index.md b/docs/source/models/extensions/index.md similarity index 69% rename from docs/source/features/model_loaders/index.md rename to docs/source/models/extensions/index.md index fba51a662728..cff09d12eba4 100644 --- a/docs/source/features/model_loaders/index.md +++ b/docs/source/models/extensions/index.md @@ -1,4 +1,4 @@ -# Alternative Model Loaders +# Built-in Extensions ```{toctree} :maxdepth: 1 diff --git a/docs/source/features/model_loaders/runai_model_streamer.md b/docs/source/models/extensions/runai_model_streamer.md similarity index 98% rename from docs/source/features/model_loaders/runai_model_streamer.md rename to docs/source/models/extensions/runai_model_streamer.md index 74e18a664558..fe2701194a60 100644 --- a/docs/source/features/model_loaders/runai_model_streamer.md +++ b/docs/source/models/extensions/runai_model_streamer.md @@ -1,6 +1,6 @@ (runai-model-streamer)= -# Run:ai Model Streamer +# Loading models with Run:ai Model Streamer Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in [Run:ai Model Streamer Documentation](https://github.com/run-ai/runai-model-streamer/blob/master/docs/README.md). diff --git a/docs/source/features/model_loaders/tensorizer.md b/docs/source/models/extensions/tensorizer.md similarity index 95% rename from docs/source/features/model_loaders/tensorizer.md rename to docs/source/models/extensions/tensorizer.md index 7168237cff22..42ed5c795dd2 100644 --- a/docs/source/features/model_loaders/tensorizer.md +++ b/docs/source/models/extensions/tensorizer.md @@ -1,6 +1,6 @@ (tensorizer)= -# Tensorizer +# Loading models with CoreWeave's Tensorizer vLLM supports loading models with [CoreWeave's Tensorizer](https://docs.coreweave.com/coreweave-machine-learning-and-ai/inference/tensorizer). vLLM model tensors that have been serialized to disk, an HTTP/HTTPS endpoint, or S3 endpoint can be deserialized diff --git a/docs/source/models/supported_models.md b/docs/source/models/supported_models.md index 5a2778026192..e3f157c477d2 100644 --- a/docs/source/models/supported_models.md +++ b/docs/source/models/supported_models.md @@ -1,6 +1,6 @@ (supported-models)= -# Supported Models +# List of supported models vLLM supports generative and pooling models across various tasks. If a model supports more than one task, you can set the task via the {code}`--task` argument. From 204ca1ce755969c8129bd837f3e82a053468db2b Mon Sep 17 00:00:00 2001 From: DarkLight1337 Date: Mon, 6 Jan 2025 14:48:35 +0000 Subject: [PATCH 10/11] Remove unnecessary code directives Signed-off-by: DarkLight1337 --- docs/source/models/supported_models.md | 42 +++++++++++++------------- 1 file changed, 21 insertions(+), 21 deletions(-) diff --git a/docs/source/models/supported_models.md b/docs/source/models/supported_models.md index e3f157c477d2..4fd0fec14193 100644 --- a/docs/source/models/supported_models.md +++ b/docs/source/models/supported_models.md @@ -3,7 +3,7 @@ # List of supported models vLLM supports generative and pooling models across various tasks. -If a model supports more than one task, you can set the task via the {code}`--task` argument. +If a model supports more than one task, you can set the task via the `--task` argument. For each task, we list the model architectures that have been implemented in vLLM. Alongside each architecture, we include some popular models that use it. @@ -14,8 +14,8 @@ Alongside each architecture, we include some popular models that use it. By default, vLLM loads models from [HuggingFace (HF) Hub](https://huggingface.co/models). -To determine whether a given model is supported, you can check the {code}`config.json` file inside the HF repository. -If the {code}`"architectures"` field contains a model architecture listed below, then it should be supported in theory. +To determine whether a given model is supported, you can check the `config.json` file inside the HF repository. +If the `"architectures"` field contains a model architecture listed below, then it should be supported in theory. ````{tip} The easiest way to check if your model is really supported at runtime is to run the program below: @@ -48,7 +48,7 @@ To use models from [ModelScope](https://www.modelscope.cn) instead of HuggingFac $ export VLLM_USE_MODELSCOPE=True ``` -And use with {code}`trust_remote_code=True`. +And use with `trust_remote_code=True`. ```python from vllm import LLM @@ -420,15 +420,15 @@ you should explicitly specify the task type to ensure that the model is used in ``` ```{note} -{code}`ssmits/Qwen2-7B-Instruct-embed-base` has an improperly defined Sentence Transformers config. -You should manually set mean pooling by passing {code}`--override-pooler-config '{"pooling_type": "MEAN"}'`. +`ssmits/Qwen2-7B-Instruct-embed-base` has an improperly defined Sentence Transformers config. +You should manually set mean pooling by passing `--override-pooler-config '{"pooling_type": "MEAN"}'`. ``` ```{note} -Unlike base Qwen2, {code}`Alibaba-NLP/gte-Qwen2-7B-instruct` uses bi-directional attention. -You can set {code}`--hf-overrides '{"is_causal": false}'` to change the attention mask accordingly. +Unlike base Qwen2, `Alibaba-NLP/gte-Qwen2-7B-instruct` uses bi-directional attention. +You can set `--hf-overrides '{"is_causal": false}'` to change the attention mask accordingly. -On the other hand, its 1.5B variant ({code}`Alibaba-NLP/gte-Qwen2-1.5B-instruct`) uses causal attention +On the other hand, its 1.5B variant (`Alibaba-NLP/gte-Qwen2-1.5B-instruct`) uses causal attention despite being described otherwise on its model card. ``` @@ -468,8 +468,8 @@ If your model is not in the above list, we will try to automatically convert the {func}`vllm.model_executor.models.adapters.as_reward_model`. By default, we return the hidden states of each token directly. ```{important} -For process-supervised reward models such as {code}`peiyi9979/math-shepherd-mistral-7b-prm`, the pooling config should be set explicitly, -e.g.: {code}`--override-pooler-config '{"pooling_type": "STEP", "step_tag_id": 123, "returned_token_ids": [456, 789]}'`. +For process-supervised reward models such as `peiyi9979/math-shepherd-mistral-7b-prm`, the pooling config should be set explicitly, +e.g.: `--override-pooler-config '{"pooling_type": "STEP", "step_tag_id": 123, "returned_token_ids": [456, 789]}'`. ``` #### Classification (`--task classify`) @@ -537,13 +537,13 @@ The following modalities are supported depending on the model: - **V**ideo - **A**udio -Any combination of modalities joined by {code}`+` are supported. +Any combination of modalities joined by `+` are supported. -- e.g.: {code}`T + I` means that the model supports text-only, image-only, and text-with-image inputs. +- e.g.: `T + I` means that the model supports text-only, image-only, and text-with-image inputs. -On the other hand, modalities separated by {code}`/` are mutually exclusive. +On the other hand, modalities separated by `/` are mutually exclusive. -- e.g.: {code}`T / I` means that the model supports text-only and image-only inputs, but not text-with-image inputs. +- e.g.: `T / I` means that the model supports text-only and image-only inputs, but not text-with-image inputs. See [this page](#multimodal-inputs) on how to pass multi-modal inputs to the model. @@ -731,8 +731,8 @@ See [this page](#generative-models) for more information on how to use generativ + Multiple items can be inputted per text prompt for this modality. ````{important} -To enable multiple multi-modal items per text prompt, you have to set {code}`limit_mm_per_prompt` (offline inference) -or {code}`--limit-mm-per-prompt` (online inference). For example, to enable passing up to 4 images per text prompt: +To enable multiple multi-modal items per text prompt, you have to set `limit_mm_per_prompt` (offline inference) +or `--limit-mm-per-prompt` (online inference). For example, to enable passing up to 4 images per text prompt: ```python llm = LLM( @@ -751,11 +751,11 @@ vLLM currently only supports adding LoRA to the language backbone of multimodal ``` ```{note} -To use {code}`TIGER-Lab/Mantis-8B-siglip-llama3`, you have pass {code}`--hf_overrides '{"architectures": ["MantisForConditionalGeneration"]}'` when running vLLM. +To use `TIGER-Lab/Mantis-8B-siglip-llama3`, you have pass `--hf_overrides '{"architectures": ["MantisForConditionalGeneration"]}'` when running vLLM. ``` ```{note} -The official {code}`openbmb/MiniCPM-V-2` doesn't work yet, so we need to use a fork ({code}`HwwwH/MiniCPM-V-2`) for now. +The official `openbmb/MiniCPM-V-2` doesn't work yet, so we need to use a fork (`HwwwH/MiniCPM-V-2`) for now. For more details, please see: ``` @@ -770,7 +770,7 @@ you should explicitly specify the task type to ensure that the model is used in #### Text Embedding (`--task embed`) -Any text generation model can be converted into an embedding model by passing {code}`--task embed`. +Any text generation model can be converted into an embedding model by passing `--task embed`. ```{note} To get the best results, you should use pooling models that are specifically trained as such. @@ -818,7 +818,7 @@ At vLLM, we are committed to facilitating the integration and support of third-p 2. **Best-Effort Consistency**: While we aim to maintain a level of consistency between the models implemented in vLLM and other frameworks like transformers, complete alignment is not always feasible. Factors like acceleration techniques and the use of low-precision computations can introduce discrepancies. Our commitment is to ensure that the implemented models are functional and produce sensible results. ```{tip} -When comparing the output of {code}`model.generate` from HuggingFace Transformers with the output of {code}`llm.generate` from vLLM, note that the former reads the model's generation config file (i.e., [generation_config.json](https://github.com/huggingface/transformers/blob/19dabe96362803fb0a9ae7073d03533966598b17/src/transformers/generation/utils.py#L1945)) and applies the default parameters for generation, while the latter only uses the parameters passed to the function. Ensure all sampling parameters are identical when comparing outputs. +When comparing the output of `model.generate` from HuggingFace Transformers with the output of `llm.generate` from vLLM, note that the former reads the model's generation config file (i.e., [generation_config.json](https://github.com/huggingface/transformers/blob/19dabe96362803fb0a9ae7073d03533966598b17/src/transformers/generation/utils.py#L1945)) and applies the default parameters for generation, while the latter only uses the parameters passed to the function. Ensure all sampling parameters are identical when comparing outputs. ``` 3. **Issue Resolution and Model Updates**: Users are encouraged to report any bugs or issues they encounter with third-party models. Proposed fixes should be submitted via PRs, with a clear explanation of the problem and the rationale behind the proposed solution. If a fix for one model impacts another, we rely on the community to highlight and address these cross-model dependencies. Note: for bugfix PRs, it is good etiquette to inform the original author to seek their feedback. From 9d2e4e459b8458957232906dfe8fec64b385e744 Mon Sep 17 00:00:00 2001 From: DarkLight1337 Date: Mon, 6 Jan 2025 14:49:14 +0000 Subject: [PATCH 11/11] Update header Signed-off-by: DarkLight1337 --- README.md | 2 +- docs/source/models/supported_models.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index f83c9d759b35..652268ec29ca 100644 --- a/README.md +++ b/README.md @@ -77,7 +77,7 @@ pip install vllm Visit our [documentation](https://vllm.readthedocs.io/en/latest/) to learn more. - [Installation](https://vllm.readthedocs.io/en/latest/getting_started/installation.html) - [Quickstart](https://vllm.readthedocs.io/en/latest/getting_started/quickstart.html) -- [Supported Models](https://vllm.readthedocs.io/en/latest/models/supported_models.html) +- [List of Supported Models](https://vllm.readthedocs.io/en/latest/models/supported_models.html) ## Contributing diff --git a/docs/source/models/supported_models.md b/docs/source/models/supported_models.md index 4fd0fec14193..b025e18b2989 100644 --- a/docs/source/models/supported_models.md +++ b/docs/source/models/supported_models.md @@ -1,6 +1,6 @@ (supported-models)= -# List of supported models +# List of Supported Models vLLM supports generative and pooling models across various tasks. If a model supports more than one task, you can set the task via the `--task` argument.