Skip to content

Commit 646ffea

Browse files
committed
Add benchmarking folder with common config set ups - prefix cache aware example and chart
1 parent 9286c12 commit 646ffea

File tree

10 files changed

+477
-0
lines changed

10 files changed

+477
-0
lines changed

benchmarking/Chart.yaml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
apiVersion: v2
2+
name: precise-prefix-cache-aware
3+
description: A Helm chart for precise-prefix-cache-aware benchmarking
4+
version: 0.1.0
5+
appVersion: "1.0"

benchmarking/README.md

Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
# Benchmarking Helm Chart
2+
3+
This Helm chart deploys the `inference-perf` benchmarking tool. This guide will walk you through deploying a basic benchmarking job. By default, the `shareGPT` dataset is used for configuration.
4+
5+
## Prerequisites
6+
7+
Before you begin, ensure you have the following:
8+
9+
* **Helm 3+**: [Installation Guide](https://helm.sh/docs/intro/install/)
10+
* **Kubernetes Cluster**: Access to a Kubernetes cluster
11+
* **Gateway Deployed**: Your inference server/gateway must be deployed and accessible within the cluster.
12+
13+
14+
**Hugging Face Token Secret**
15+
16+
The benchmark requires a Hugging Face token to pull models. Create a Kubernetes Secret named `hf-token` (or a custom name you provide) in your target namespace, containing your Hugging Face token.
17+
18+
To create this secret:
19+
```bash
20+
export _HF_TOKEN='<YOUR_HF_TOKEN>'
21+
kubectl create secret generic hf-token --from-literal=token=$_HF_TOKEN
22+
```
23+
24+
## Deployment
25+
26+
To deploy the benchmarking chart:
27+
28+
```bash
29+
export IP='<YOUR_IP>'
30+
export PORT='<YOUR_PORT>'
31+
helm install benchmark . -f benchmark-values.yaml \
32+
--set hfTokenSecret.name=hf-token \
33+
--set hfTokenSecret.key=token \
34+
--set "config.server.base_url=http://${IP}:${PORT}"
35+
```
36+
37+
**Parameters to customize:**
38+
39+
* `benchmark`: A unique name for this deployment.
40+
* `hfTokenSecret.name`: The name of your Kubernetes Secret containing the Hugging Face token (default: `hf-token`).
41+
* `hfTokenSecret.key`: The key in your Kubernetes Secret pointing to the Hugging Face token (default: `token`).
42+
* `config.server.base_url`: The base URL (IP and port) of your inference server.
43+
44+
### Storage Parameters
45+
46+
The following is how to add storage to the config.
47+
By default we save to local storage however once the inference-perf job is completed the pod is deleted.
48+
49+
```yaml
50+
storage:
51+
local_storage:
52+
path: "reports-{timestamp}" # Local directory path
53+
report_file_prefix: null # Optional filename prefix
54+
google_cloud_storage: # Optional GCS configuration
55+
bucket_name: "your-bucket-name" # Required GCS bucket
56+
path: "reports-{timestamp}" # Optional path prefix
57+
report_file_prefix: null # Optional filename prefix
58+
simple_storage_service:
59+
bucket_name: "your-bucket-name" # Required S3 bucket
60+
path: "reports-{timestamp}" # Optional path prefix
61+
report_file_prefix: null # Optional filename prefix
62+
```
63+
64+
## Uninstalling the Chart
65+
66+
To uninstall the deployed chart:
67+
68+
```bash
69+
helm uninstall my-benchmark
70+
```
71+

benchmarking/benchmark-values.yaml

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
# High-Cache Configuration
2+
job:
3+
image: "quay.io/inference-perf/inference-perf:latest"
4+
memory: "8G"
5+
6+
logLevel: INFO
7+
8+
hfTokenSecret:
9+
name: hf-token
10+
key: token
11+
12+
config:
13+
load:
14+
type: constant
15+
interval: 15
16+
stages:
17+
- rate: 10
18+
duration: 20
19+
- rate: 20
20+
duration: 20
21+
- rate: 30
22+
duration: 20
23+
api:
24+
type: completion
25+
streaming: true
26+
server:
27+
type: vllm
28+
model_name: meta-llama/Llama-3.1-8B-Instruct
29+
base_url: http://0.0.0.0:8000
30+
ignore_eos: true
31+
tokenizer:
32+
pretrained_model_name_or_path: meta-llama/Llama-3.1-8B-Instruct
33+
data:
34+
type: shareGPT
35+
storage:
36+
google_cloud_storage:
37+
bucket_name: "inference-perf-results"
38+
report_file_prefix: benchmark
39+
metrics:
40+
type: prometheus
41+
prometheus:
42+
google_managed: true
43+
report:
44+
request_lifecycle:
45+
summary: true
46+
per_stage: true
47+
per_request: true
48+
prometheus:
49+
summary: true
50+
per_stage: true
Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
# Precise Prefix Cache Aware Benchmarking Helm Chart
2+
3+
This Helm chart deploys the `inference-perf` benchmarking tool with two distinct configurations: a high-cache scenario and a low-cache scenario. This chart specifically utilizes the **shared prefix dataset** for benchmarking. This guide will walk you through deploying both.
4+
5+
## Prerequisites
6+
7+
Before you begin, ensure you have the following:
8+
9+
* **Helm 3+**: [Installation Guide](https://helm.sh/docs/intro/install/)
10+
* **Kubernetes Cluster**: Access to a Kubernetes cluster
11+
* **Gateway Deployed**: Your inference server/gateway must be deployed and accessible within the cluster.
12+
13+
14+
**Hugging Face Token Secret**
15+
16+
The benchmark requires a Hugging Face token to pull models. Create a Kubernetes Secret named `hf-token` (or a custom name you provide) in your target namespace, containing your Hugging Face token.
17+
18+
To create this secret:
19+
```bash
20+
export _HF_TOKEN='<YOUR_HF_TOKEN>'
21+
kubectl create secret generic hf-token --from-literal=token=$_HF_TOKEN
22+
```
23+
24+
## Shared Prefix Dataset Configuration
25+
26+
The chart uses the `shared_prefix` dataset type, which is designed to test caching efficiency. These parameters are located under config.data.shared_prefix:
27+
28+
* `num_groups`: The number of shared prefix groups.
29+
* `num_prompts_per_group`: The number of prompts within each shared prefix group.
30+
* `system_prompt_len`: The length of the system prompt.
31+
* `question_len`: The length of the question part of the prompt.
32+
* `output_len`: The desired length of the model's output.
33+
34+
The default values for the dataset are defined in the chart, but you can override them using `--set config.data.shared_prefix.<parameter>` flags.
35+
36+
Example:
37+
38+
```bash
39+
helm install my-release . -f high-cache-values.yaml --set config.data.shared_prefix.num_groups=512
40+
```
41+
42+
## Deployment
43+
44+
This chart supports two main configurations, defined in `high-cache-values.yaml` and `low-cache-values.yaml`.
45+
46+
### 1. Deploying the High-Cache Configuration
47+
48+
This configuration is optimized for scenarios where a high cache hit rate is expected. It uses the `high-cache-values.yaml` file.
49+
50+
```bash
51+
export IP='<YOUR_IP>'
52+
export PORT='<YOUR_PORT>'
53+
helm install high-cache . -f high-cache-values.yaml \
54+
--set hfTokenSecret.name=hf-token \
55+
--set hfTokenSecret.key=token \
56+
--set "config.server.base_url=http://${IP}:${PORT}"
57+
```
58+
59+
**Parameters to customize:**
60+
61+
* `high-cache`: A unique name for this deployment.
62+
* `hfTokenSecret.name`: The name of your Kubernetes Secret containing the Hugging Face token (default: `hf-token`).
63+
* `hfTokenSecret.key`: The key in your Kubernetes Secret pointing to the Hugging Face token (default: `token`).
64+
* `config.server.base_url`: The base URL (IP and port) of your inference server for the high-cache scenario.
65+
66+
### 2. Deploying the Low-Cache Configuration
67+
68+
This configuration is designed for scenarios with a lower cache hit rate. It uses the `low-cache-values.yaml` file.
69+
70+
```bash
71+
export IP='<YOUR_IP>'
72+
export PORT='<YOUR_PORT>'
73+
helm install low-cache . -f low-cache-values.yaml \
74+
-f high-cache-values.yaml \
75+
--set hfTokenSecret.name=hf-token \
76+
--set hfTokenSecret.key=token \
77+
--set "config.server.base_url=http://${IP}:${PORT}"
78+
```
79+
80+
**Parameters to customize:**
81+
82+
* `low-cache`: A unique name for this deployment.
83+
* `hfTokenSecret.name`: The name of your Kubernetes Secret containing the Hugging Face token (default: `hf-token`).
84+
* `hfTokenSecret.key`: The key in your Kubernetes Secret pointing to the Hugging Face token (default: `token`).
85+
* `config.server.base_url`: The base URL (IP and port) of your inference server for the high-cache scenario.
86+
87+
## Uninstalling the Charts
88+
89+
To uninstall the deployed charts:
90+
91+
```bash
92+
helm uninstall my-high-cache-release
93+
helm uninstall my-low-cache-release
94+
```
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
# High-Cache Configuration
2+
job:
3+
image: "quay.io/inference-perf/inference-perf:latest"
4+
memory: "8G"
5+
6+
logLevel: INFO
7+
8+
hfTokenSecret:
9+
name: hf-token
10+
key: token
11+
12+
config:
13+
load:
14+
type: constant
15+
interval: 15
16+
stages:
17+
- rate: 100
18+
duration: 30
19+
- rate: 200
20+
duration: 30
21+
worker_max_concurrency: 1000
22+
api:
23+
type: completion
24+
streaming: true
25+
server:
26+
type: vllm
27+
model_name: meta-llama/Llama-3.1-8B-Instruct
28+
base_url: http://0.0.0.0:8000
29+
ignore_eos: true
30+
tokenizer:
31+
pretrained_model_name_or_path: meta-llama/Llama-3.1-8B-Instruct
32+
data:
33+
type: shared_prefix
34+
shared_prefix:
35+
num_groups: 256
36+
num_prompts_per_group: 16
37+
system_prompt_len: 2048
38+
question_len: 256
39+
output_len: 256
40+
metrics:
41+
type: prometheus
42+
prometheus:
43+
google_managed: true
44+
report:
45+
request_lifecycle:
46+
summary: true
47+
per_stage: true
48+
per_request: true
49+
prometheus:
50+
summary: true
51+
per_stage: true
Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
# Low-Cache Configuration
2+
job:
3+
image: "quay.io/inference-perf/inference-perf:latest"
4+
memory: "8G"
5+
6+
logLevel: INFO
7+
8+
hfTokenSecret:
9+
name: hf-token
10+
key: token
11+
12+
config:
13+
load:
14+
type: constant
15+
interval: 15
16+
stages:
17+
- rate: 100
18+
duration: 30
19+
- rate: 200
20+
duration: 30
21+
- rate: 300
22+
duration: 30
23+
- rate: 400
24+
duration: 30
25+
- rate: 500
26+
duration: 30
27+
- rate: 600
28+
duration: 30
29+
- rate: 700
30+
duration: 30
31+
- rate: 800
32+
duration: 30
33+
worker_max_concurrency: 1000
34+
api:
35+
type: completion
36+
streaming: true
37+
server:
38+
type: vllm
39+
model_name: meta-llama/Llama-3.1-8B-Instruct
40+
base_url: http://0.0.0.0:8000
41+
ignore_eos: true
42+
tokenizer:
43+
pretrained_model_name_or_path: meta-llama/Llama-3.1-8B-Instruct
44+
data:
45+
type: shared_prefix
46+
shared_prefix:
47+
num_groups: 256
48+
num_prompts_per_group: 16
49+
system_prompt_len: 256 # Low-cache setting
50+
question_len: 2048 # Low-cache setting
51+
output_len: 256
52+
metrics:
53+
type: prometheus
54+
prometheus:
55+
google_managed: true
56+
report:
57+
request_lifecycle:
58+
summary: true
59+
per_stage: true
60+
per_request: true
61+
prometheus:
62+
summary: true
63+
per_stage: true

0 commit comments

Comments
 (0)