Add bucket calibration, allow reading/writing bucketing configs to file #345

kzawora-intel · 2024-09-27T19:05:08Z

This PR adds allows user to calibrate bucket usage and load bucket configuration from file.

Design

When VLLM_HPU_CALIBRATE_BUCKETS=true env var is passed, warmup will be disabled, and upon destruction, the server will store bucket configs and utilized buckets in YAML file (optionally defined in VLLM_HPU_BUCKET_CFG env var).

An example YAML file looks as follows:

bucket_cfg:
  prompt_bs_bucket_cfg: {min: 1, step: 32, max: 64}
  prompt_seq_bucket_cfg: {min: 128, step: 128, max: 1280}
  decode_bs_bucket_cfg: {min: 1, step: 32, max: 16}
  decode_block_bucket_cfg: {min: 128, step: 128, max: 128}
buckets:
  decode:
  - [16, 128]
  - [8, 128]
  - [4, 128]
  - [2, 128]
  - [1, 128]
  prefill:
  - [64, 128]
  - [4, 1280]
  - [4, 1152]
  - [1, 1152]

Optionally, user can also emit CSV with buckets (useful for data analysis using external tools):

phase,batch_size,seq_or_block
prefill,64,128
prefill,4,1280
prefill,4,1152
prefill,1,1152
decode,16,128
decode,8,128
decode,4,128
decode,2,128
decode,1,128

In CSV mode there is no way to dump the bucket_cfg data.

User can manually alter the file and change the bucket settings as needed.
(only YAML) Ranges of bucket parameters (min/max) in generated YAML file are updated to include used buckets - i.e. if a given workload uses prefills with sequence length of 1280, and default max prefill sequence length is 1024, it will be extended to 1280.
(only YAML) User can also remove the "buckets" section and only provide bucket min/step/max settings, in which case buckets will be generated during runtime using provided settings - this can serve as an alternative for providing VLLM_{phase}_{dim}_BUCKET_{param} environment variables.
(only YAML) User can also remove the "bucket_cfg" section and only provide the list of buckets. In that case, buckets might be out of range w.r.t. bucket settings.
(only YAML) VLLM_{phase}_{dim}_BUCKET_{param} environment variables override values provided in the YAML file

Usage

If VLLM_HPU_CALIBRATE_BUCKETS is true or 1 and VLLM_HPU_BUCKET_CFG is not provided, calibration will happen, and calibration results will be saved to hpu-buckets-{vllm_instance_id}.yaml.
If VLLM_HPU_CALIBRATE_BUCKETS is true or 1 and VLLM_HPU_BUCKET_CFG is provided, calibration will happen, and calibration results will be saved to a file path defined by VLLM_HPU_BUCKET_CFG. If extension of VLLM_HPU_BUCKET_CFG is .csv (case insensitive), buckets will be saved in CSV format, if extension is yml or yaml (case insensitive), buckets and their ranges will be saved in YAML format.
If VLLM_HPU_CALIBRATE_BUCKETS is not true or 1 and VLLM_HPU_BUCKET_CFG is provided, calibration will not happen, and bucket settings will be loaded from a file path defined by VLLM_HPU_BUCKET_CFG (both YAML and CSV supported)
If VLLM_HPU_CALIBRATE_BUCKETS is not true or 1 and VLLM_HPU_BUCKET_CFG is not provided, calibration will not happen, and bucket generation will not be altered in any way (default behavior)

Examples:

Calibration with unspecified output file

Input:

VLLM_HPU_CALIBRATE_BUCKETS=true vllm serve ...

Output:

...
INFO 09-30 14:24:24 selector.py:147] Using HabanaAttention backend.
**INFO 09-30 14:24:24 habana_model_runner.py:575] Calibration results will be saved to hpu-buckets-vllm-instance-05d1c80d2f4541819d95b508117b130c.yaml**
INFO 09-30 14:24:24 habana_model_runner.py:99] VLLM_PROMPT_BS_BUCKET_MIN=1 (default:1)
INFO 09-30 14:24:24 habana_model_runner.py:99] VLLM_PROMPT_BS_BUCKET_STEP=32 (default:32)
INFO 09-30 14:24:24 habana_model_runner.py:99] VLLM_PROMPT_BS_BUCKET_MAX=64 (default:64)
INFO 09-30 14:24:24 habana_model_runner.py:99] VLLM_DECODE_BS_BUCKET_MIN=1 (default:1)
INFO 09-30 14:24:24 habana_model_runner.py:99] VLLM_DECODE_BS_BUCKET_STEP=32 (default:32)
INFO 09-30 14:24:24 habana_model_runner.py:99] VLLM_DECODE_BS_BUCKET_MAX=64 (default:64)
INFO 09-30 14:24:24 habana_model_runner.py:99] VLLM_PROMPT_SEQ_BUCKET_MIN=128 (default:128)
INFO 09-30 14:24:24 habana_model_runner.py:99] VLLM_PROMPT_SEQ_BUCKET_STEP=128 (default:128)
INFO 09-30 14:24:24 habana_model_runner.py:99] VLLM_PROMPT_SEQ_BUCKET_MAX=1024 (default:1024)
INFO 09-30 14:24:24 habana_model_runner.py:99] VLLM_DECODE_BLOCK_BUCKET_MIN=128 (default:128)
INFO 09-30 14:24:24 habana_model_runner.py:99] VLLM_DECODE_BLOCK_BUCKET_STEP=128 (default:128)
INFO 09-30 14:24:24 habana_model_runner.py:99] VLLM_DECODE_BLOCK_BUCKET_MAX=2048 (default:2048)
INFO 09-30 14:24:24 habana_model_runner.py:758] Prompt bucket config (min, step, max_warmup) bs:[1, 32, 64], seq:[128, 128, 1024]
INFO 09-30 14:24:24 habana_model_runner.py:763] Decode bucket config (min, step, max_warmup) bs:[1, 32, 64], block:[128, 128, 2048]
...
INFO 09-30 14:24:35 habana_executor.py:87] # HPU blocks: 3184, # CPU blocks: 256
INFO 09-30 14:24:36 habana_worker.py:212] Initializing cache engine took 49.75 GiB of device memory (65.4 GiB/94.62 GiB used) and 1.984 GiB of host memory (541.3 GiB/1007 GiB used)
**INFO 09-30 14:24:36 habana_model_runner.py:1646] Skipping warmup...**
INFO 09-30 14:24:36 habana_executor.py:93] init_cache_engine took 49.75 GiB of device memory (65.4 GiB/94.62 GiB used) and 1.992 GiB of host memory (541.3 GiB/1007 GiB used)
...
**INFO 09-30 14:26:40 habana_model_runner.py:803] Bucket calibration settings saved to hpu-buckets-vllm-instance-05d1c80d2f4541819d95b508117b130c.yaml**

Calibration with specified output file

Input:

VLLM_HPU_CALIBRATE_BUCKETS=true VLLM_HPU_BUCKET_CFG=blabla.yml vllm serve ...

Output log:

...
INFO 09-30 14:03:30 selector.py:147] Using HabanaAttention backend.
**INFO 09-30 14:03:30 habana_model_runner.py:575] Calibration results will be saved to blabla.yml**
INFO 09-30 14:03:30 habana_model_runner.py:99] VLLM_PROMPT_BS_BUCKET_MIN=1 (default:1)
INFO 09-30 14:03:30 habana_model_runner.py:99] VLLM_PROMPT_BS_BUCKET_STEP=32 (default:32)
INFO 09-30 14:03:30 habana_model_runner.py:99] VLLM_PROMPT_BS_BUCKET_MAX=64 (default:64)
INFO 09-30 14:03:30 habana_model_runner.py:99] VLLM_DECODE_BS_BUCKET_MIN=1 (default:1)
INFO 09-30 14:03:30 habana_model_runner.py:99] VLLM_DECODE_BS_BUCKET_STEP=32 (default:32)
INFO 09-30 14:03:30 habana_model_runner.py:99] VLLM_DECODE_BS_BUCKET_MAX=64 (default:64)
INFO 09-30 14:03:30 habana_model_runner.py:99] VLLM_PROMPT_SEQ_BUCKET_MIN=128 (default:128)
INFO 09-30 14:03:30 habana_model_runner.py:99] VLLM_PROMPT_SEQ_BUCKET_STEP=128 (default:128)
INFO 09-30 14:03:30 habana_model_runner.py:99] VLLM_PROMPT_SEQ_BUCKET_MAX=1024 (default:1024)
INFO 09-30 14:03:30 habana_model_runner.py:99] VLLM_DECODE_BLOCK_BUCKET_MIN=128 (default:128)
INFO 09-30 14:03:30 habana_model_runner.py:99] VLLM_DECODE_BLOCK_BUCKET_STEP=128 (default:128)
INFO 09-30 14:03:30 habana_model_runner.py:99] VLLM_DECODE_BLOCK_BUCKET_MAX=2048 (default:2048)
INFO 09-30 14:03:30 habana_model_runner.py:758] Prompt bucket config (min, step, max_warmup) bs:[1, 32, 64], seq:[128, 128, 1024]
INFO 09-30 14:03:30 habana_model_runner.py:763] Decode bucket config (min, step, max_warmup) bs:[1, 32, 64], block:[128, 128, 2048]
...
INFO 09-30 14:03:42 habana_executor.py:87] # HPU blocks: 3184, # CPU blocks: 256
INFO 09-30 14:03:42 habana_worker.py:212] Initializing cache engine took 49.75 GiB of device memory (65.4 GiB/94.62 GiB used) and 1.971 GiB of host memory (537.7 GiB/1007 GiB used)
**INFO 09-30 14:03:42 habana_model_runner.py:1646] Skipping warmup...**
INFO 09-30 14:03:42 habana_executor.py:93] init_cache_engine took 49.75 GiB of device memory (65.4 GiB/94.62 GiB used) and 1.971 GiB of host memory (537.7 GiB/1007 GiB used)
...
**INFO 09-30 14:06:29 habana_model_runner.py:803] Bucket calibration settings saved to blabla.yml**

Calibration yaml:

bucket_cfg:
  prompt_bs_bucket_cfg: {min: 1, step: 32, max: 64}
  prompt_seq_bucket_cfg: {min: 128, step: 128, max: 1280}
  decode_bs_bucket_cfg: {min: 1, step: 32, max: 128}
  decode_block_bucket_cfg: {min: 128, step: 128, max: 1408}
buckets:
  decode:
  - [128, 1408]
  - [128, 1280]
  - [128, 1152]
  - [128, 1024]
  - [96, 1024]
  - [96, 896]
  - [96, 768]
  - [64, 640]
  - [64, 512]
  - [64, 384]
  - [32, 384]
  - [32, 256]
  - [16, 256]
  - [16, 128]
  - [8, 128]
  - [4, 128]
  - [2, 128]
  - [1, 128]
  prefill:
  - [64, 128]
  - [4, 1280]
  - [4, 1152]
  - [2, 1280]
  - [2, 1152]
  - [1, 1280]
  - [1, 1152]

Loading calibration YAML

Input:

VLLM_HPU_BUCKET_CFG=blabla.yml vllm serve ...

Output log:

...
INFO 09-30 14:29:02 selector.py:147] Using HabanaAttention backend.
**INFO 09-30 14:29:02 habana_model_runner.py:99] VLLM_PROMPT_BS_BUCKET_MIN=1 (VLLM_HPU_BUCKET_CFG:1)**
**INFO 09-30 14:29:02 habana_model_runner.py:99] VLLM_PROMPT_BS_BUCKET_STEP=32 (VLLM_HPU_BUCKET_CFG:32)**
**INFO 09-30 14:29:02 habana_model_runner.py:99] VLLM_PROMPT_BS_BUCKET_MAX=64 (VLLM_HPU_BUCKET_CFG:64)**
**INFO 09-30 14:29:02 habana_model_runner.py:99] VLLM_DECODE_BS_BUCKET_MIN=1 (VLLM_HPU_BUCKET_CFG:1)**
**INFO 09-30 14:29:02 habana_model_runner.py:99] VLLM_DECODE_BS_BUCKET_STEP=32 (VLLM_HPU_BUCKET_CFG:32)**
**INFO 09-30 14:29:02 habana_model_runner.py:99] VLLM_DECODE_BS_BUCKET_MAX=64 (VLLM_HPU_BUCKET_CFG:64)**
**INFO 09-30 14:29:02 habana_model_runner.py:99] VLLM_PROMPT_SEQ_BUCKET_MIN=128 (VLLM_HPU_BUCKET_CFG:128)**
**INFO 09-30 14:29:02 habana_model_runner.py:99] VLLM_PROMPT_SEQ_BUCKET_STEP=128 (VLLM_HPU_BUCKET_CFG:128)**
**INFO 09-30 14:29:02 habana_model_runner.py:99] VLLM_PROMPT_SEQ_BUCKET_MAX=1024 (VLLM_HPU_BUCKET_CFG:1024)**
**INFO 09-30 14:29:02 habana_model_runner.py:99] VLLM_DECODE_BLOCK_BUCKET_MIN=128 (VLLM_HPU_BUCKET_CFG:128)**
**INFO 09-30 14:29:02 habana_model_runner.py:99] VLLM_DECODE_BLOCK_BUCKET_STEP=128 (VLLM_HPU_BUCKET_CFG:128)**
**INFO 09-30 14:29:02 habana_model_runner.py:99] VLLM_DECODE_BLOCK_BUCKET_MAX=2048 (VLLM_HPU_BUCKET_CFG:2048)**
INFO 09-30 14:29:02 habana_model_runner.py:758] Prompt bucket config (min, step, max_warmup) bs:[1, 32, 64], seq:[128, 128, 1024]
INFO 09-30 14:29:02 habana_model_runner.py:763] Decode bucket config (min, step, max_warmup) bs:[1, 32, 64], block:[128, 128, 2048]
**INFO 09-30 14:29:02 habana_model_runner.py:769] Loaded 7 prompt buckets from file [bs, seq]: [(1, 1152), (1, 1280), (2, 1152), (2, 1280), (4, 1152), (4, 1280), (64, 128)]**
**INFO 09-30 14:29:02 habana_model_runner.py:775] Loaded 18 decode buckets from file [bs, block]: [(1, 128), (2, 128), (4, 128), (8, 128), (16, 128), (16, 256), (32, 256), (32, 384), (64, 384), (64, 512), (64, 640), (96, 768), (96, 896), (96, 1024), (128, 1024), (128, 1152), (128, 1280), (128, 1408)]**
...
INFO 09-30 14:29:13 habana_executor.py:87] # HPU blocks: 3184, # CPU blocks: 256
INFO 09-30 14:29:14 habana_worker.py:212] Initializing cache engine took 49.75 GiB of device memory (65.4 GiB/94.62 GiB used) and 1.982 GiB of host memory (550.4 GiB/1007 GiB used)
INFO 09-30 14:29:14 habana_model_runner.py:1665] Generated 7 prompt buckets [bs, seq]: [(1, 1152), (1, 1280), (2, 1152), (2, 1280), (4, 1152), (4, 1280), (64, 128)]
INFO 09-30 14:29:14 habana_model_runner.py:1670] Omitted 0 prompt buckets due to exceeded token budget (max_num_batched_tokens=4096)
INFO 09-30 14:29:14 habana_model_runner.py:1684] Generated 18 decode buckets [bs, total_blocks]: [(1, 128), (2, 128), (4, 128), (8, 128), (16, 128), (16, 256), (32, 256), (32, 384), (64, 384), (64, 512), (64, 640), (96, 768), (96, 896), (96, 1024), (128, 1024), (128, 1152), (128, 1280), (128, 1408)]
INFO 09-30 14:29:14 habana_model_runner.py:1568] [Warmup][Prompt][1/7] batch_size:1 seq_len:1152 free_mem:29.23 GiB
...
INFO 09-30 14:29:19 habana_model_runner.py:1568] [Warmup][Prompt][7/7] batch_size:64 seq_len:128 free_mem:29.23 GiB
INFO 09-30 14:29:19 habana_model_runner.py:1568] [Warmup][Decode][1/18] batch_size:1 num_blocks:128 free_mem:29.23 GiB
...
INFO 09-30 14:29:27 habana_model_runner.py:1568] [Warmup][Decode][18/18] batch_size:128 num_blocks:1408 free_mem:29.23 GiB
INFO 09-30 14:29:27 habana_model_runner.py:1741] Using 21.33 GiB/29.23 GiB of free device memory for HPUGraphs, 6.399 GiB for prompt and 14.93 GiB for decode (VLLM_GRAPH_PROMPT_RATIO=0.3)
INFO 09-30 14:29:27 habana_model_runner.py:1568] [Warmup][Graph/Prompt][1/7] batch_size:1 num_blocks:1152 free_mem:29.23 GiB
...
INFO 09-30 14:29:30 habana_model_runner.py:1568] [Warmup][Graph/Prompt][7/7] batch_size:64 num_blocks:128 free_mem:29.17 GiB
INFO 09-30 14:29:30 habana_model_runner.py:1568] [Warmup][Graph/Decode][1/18] batch_size:128 num_blocks:1024 free_mem:29.16 GiB
...
INFO 09-30 14:29:38 habana_model_runner.py:1568] [Warmup][Graph/Decode][18/18] batch_size:1 num_blocks:128 free_mem:29.15 GiB
INFO 09-30 14:29:38 habana_model_runner.py:1632] Graph/Prompt captured:7 (100.0%) used_mem:64.04 MiB buckets:[(1, 1152), (1, 1280), (2, 1152), (2, 1280), (4, 1152), (4, 1280), (64, 128)]
INFO 09-30 14:29:38 habana_model_runner.py:1632] Graph/Decode captured:18 (100.0%) used_mem:11.23 MiB buckets:[(1, 128), (2, 128), (4, 128), (8, 128), (16, 128), (16, 256), (32, 256), (32, 384), (64, 384), (64, 512), (64, 640), (96, 768), (96, 896), (96, 1024), (128, 1024), (128, 1152), (128, 1280), (128, 1408)]
INFO 09-30 14:29:38 habana_model_runner.py:1789] Warmup finished in 25 secs, allocated 75.27 MiB of device memory
...

Default behavior:

Input:

vllm serve ...

Output log:

INFO 09-30 14:31:52 habana_executor.py:87] # HPU blocks: 3184, # CPU blocks: 256
INFO 09-30 14:31:53 habana_worker.py:212] Initializing cache engine took 49.75 GiB of device memory (65.4 GiB/94.62 GiB used) and 1.978 GiB of host memory (538.9 GiB/1007 GiB used)
INFO 09-30 14:31:53 habana_model_runner.py:1665] Generated 31 prompt buckets [bs, seq]: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (8, 128), (8, 256), (8, 384), (8, 512), (16, 128), (16, 256), (32, 128)]
INFO 09-30 14:31:53 habana_model_runner.py:1670] Omitted 25 prompt buckets due to exceeded token budget (max_num_batched_tokens=4096)
INFO 09-30 14:31:53 habana_model_runner.py:1684] Generated 112 decode buckets [bs, total_blocks]: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (1, 1280), (1, 1408), (1, 1536), (1, 1664), (1, 1792), (1, 1920), (1, 2048), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (2, 1152), (2, 1280), (2, 1408), (2, 1536), (2, 1664), (2, 1792), (2, 1920), (2, 2048), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (4, 1152), (4, 1280), (4, 1408), (4, 1536), (4, 1664), (4, 1792), (4, 1920), (4, 2048), (8, 128), (8, 256), (8, 384), (8, 512), (8, 640), (8, 768), (8, 896), (8, 1024), (8, 1152), (8, 1280), (8, 1408), (8, 1536), (8, 1664), (8, 1792), (8, 1920), (8, 2048), (16, 128), (16, 256), (16, 384), (16, 512), (16, 640), (16, 768), (16, 896), (16, 1024), (16, 1152), (16, 1280), (16, 1408), (16, 1536), (16, 1664), (16, 1792), (16, 1920), (16, 2048), (32, 128), (32, 256), (32, 384), (32, 512), (32, 640), (32, 768), (32, 896), (32, 1024), (32, 1152), (32, 1280), (32, 1408), (32, 1536), (32, 1664), (32, 1792), (32, 1920), (32, 2048), (64, 128), (64, 256), (64, 384), (64, 512), (64, 640), (64, 768), (64, 896), (64, 1024), (64, 1152), (64, 1280), (64, 1408), (64, 1536), (64, 1664), (64, 1792), (64, 1920), (64, 2048)]
INFO 09-30 14:31:53 habana_model_runner.py:1568] [Warmup][Prompt][1/31] batch_size:4 seq_len:1024 free_mem:29.23 GiB
...
INFO 09-30 14:32:12 habana_model_runner.py:1568] [Warmup][Prompt][31/31] batch_size:1 seq_len:128 free_mem:29.23 GiB
INFO 09-30 14:32:12 habana_model_runner.py:1568] [Warmup][Decode][1/112] batch_size:64 num_blocks:2048 free_mem:29.23 GiB
...
INFO 09-30 14:32:56 habana_model_runner.py:1568] [Warmup][Decode][112/112] batch_size:1 num_blocks:128 free_mem:29.23 GiB
INFO 09-30 14:32:57 habana_model_runner.py:1741] Using 21.33 GiB/29.23 GiB of free device memory for HPUGraphs, 6.399 GiB for prompt and 14.93 GiB for decode (VLLM_GRAPH_PROMPT_RATIO=0.3)
INFO 09-30 14:32:57 habana_model_runner.py:1568] [Warmup][Graph/Prompt][1/31] batch_size:1 num_blocks:128 free_mem:29.23 GiB
...
INFO 09-30 14:33:09 habana_model_runner.py:1568] [Warmup][Graph/Prompt][31/31] batch_size:4 num_blocks:1024 free_mem:29.16 GiB
INFO 09-30 14:33:10 habana_model_runner.py:1568] [Warmup][Graph/Decode][1/112] batch_size:64 num_blocks:128 free_mem:29.14 GiB
...
INFO 09-30 14:33:57 habana_model_runner.py:1568] [Warmup][Graph/Decode][112/112] batch_size:1 num_blocks:2048 free_mem:29.1 GiB
INFO 09-30 14:33:57 habana_model_runner.py:1632] Graph/Prompt captured:31 (100.0%) used_mem:85.03 MiB buckets:[(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (8, 128), (8, 256), (8, 384), (8, 512), (16, 128), (16, 256), (32, 128)]
INFO 09-30 14:33:57 habana_model_runner.py:1632] Graph/Decode captured:112 (100.0%) used_mem:45.65 MiB buckets:[(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (1, 1280), (1, 1408), (1, 1536), (1, 1664), (1, 1792), (1, 1920), (1, 2048), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (2, 1152), (2, 1280), (2, 1408), (2, 1536), (2, 1664), (2, 1792), (2, 1920), (2, 2048), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (4, 1152), (4, 1280), (4, 1408), (4, 1536), (4, 1664), (4, 1792), (4, 1920), (4, 2048), (8, 128), (8, 256), (8, 384), (8, 512), (8, 640), (8, 768), (8, 896), (8, 1024), (8, 1152), (8, 1280), (8, 1408), (8, 1536), (8, 1664), (8, 1792), (8, 1920), (8, 2048), (16, 128), (16, 256), (16, 384), (16, 512), (16, 640), (16, 768), (16, 896), (16, 1024), (16, 1152), (16, 1280), (16, 1408), (16, 1536), (16, 1664), (16, 1792), (16, 1920), (16, 2048), (32, 128), (32, 256), (32, 384), (32, 512), (32, 640), (32, 768), (32, 896), (32, 1024), (32, 1152), (32, 1280), (32, 1408), (32, 1536), (32, 1664), (32, 1792), (32, 1920), (32, 2048), (64, 128), (64, 256), (64, 384), (64, 512), (64, 640), (64, 768), (64, 896), (64, 1024), (64, 1152), (64, 1280), (64, 1408), (64, 1536), (64, 1664), (64, 1792), (64, 1920), (64, 2048)]
INFO 09-30 14:33:57 habana_model_runner.py:1789] Warmup finished in 124 secs, allocated 130.7 MiB of device memory

Warmup time and number of buckets has drastically decreased, as only buckets used at least once by given workload are used, and the remaining ones are discarded.

…s to file

michalkuligowski · 2024-10-01T07:55:41Z

vllm/worker/habana_model_runner.py

+        import pandas as pd
+
+        def yaml_serializer(df, bucket_cfg_file):
+            import yaml


Do we need to add pyyaml to requirements?

michalkuligowski · 2024-10-01T07:55:53Z

vllm/worker/habana_model_runner.py

+                return {'min': cfg[0], 'step': cfg[1], 'max': cfg[2]}
+
+            data: Dict[str, Any] = {}  # type: ignore
+            #data['buckets'] = df.to_dict(orient='records')


commented code

michalkuligowski · 2024-10-01T07:58:34Z

vllm/worker/habana_model_runner.py

-            logger.warning("Configuration: (%s, %s, %s) was not warmed-up!",
-                           phase, batch_size, seq_len)
+            if not self.calibrate_buckets:
+                logger.warning(


Do we need to divide code into 3 lines? We now have wide displays and it does nt make it more readable

Do we need to divide code into 3 lines?

I wish we didn't, but format.sh made this into such an abomination.

szutenberg

Hi @kzawora-intel ,

Could you please add documentation to README_GAUDI.md?
I don't understand motivation behind this feature and how it can be used in the production.
Do you plan to upstream it?

Let's have a look at your example:

(decode, [128, 1024]) => fine, it will be warmed up
(decode, [128, 1152]) => fine, it will be warmed up
(decode, [128, 896]) => it will not be warmed up

What will happen in the third example:

vllm will use warmed up (decode, [128, 1024])?
vllm will compile (decode, [128, 896])?

IMHO the second option would be non-intuitive and would make this feature not usable in the production.

github-actions · 2025-02-07T02:01:01Z

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

github-actions · 2025-03-10T01:52:52Z

This pull request has been automatically closed due to inactivity. Please feel free to reopen if you intend to continue working on it. Thank you!

Draft: Add bucket calibration, allow reading/writing bucketing config…

1884078

…s to file

kzawora-intel marked this pull request as draft September 27, 2024 19:05

kzawora-intel added 3 commits September 30, 2024 13:55

refine overall implementation

699e106

getattr in destructor

b4f0fea

format.sh

00831bc

kzawora-intel changed the title ~~Draft: Add bucket calibration, allow reading/writing bucketing configs to file~~ Add bucket calibration, allow reading/writing bucketing configs to file Sep 30, 2024

kzawora-intel marked this pull request as ready for review September 30, 2024 11:36

michalkuligowski mentioned this pull request Sep 30, 2024

[RFC]: change VLLM_DECODE_BLOCK_BUCKET_* design to fit small AND large batch size at one warmup #328

Closed

1 task

kzawora-intel added 3 commits September 30, 2024 17:32

change data format

086f97a

add csv serializer/deserializer

efd05ff

format.sh

6c6bc62

michalkuligowski reviewed Oct 1, 2024

View reviewed changes

szutenberg requested changes Oct 1, 2024

View reviewed changes

michalkuligowski mentioned this pull request Nov 4, 2024

enable VLLM_LIMIT_HPU_GRAPH to disable graph warmup for prompt #66

Closed

kzawora-intel added the habana Issues or PRs submitted by Habana Labs label Nov 8, 2024

github-actions bot added the stale label Feb 7, 2025

github-actions bot closed this Mar 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add bucket calibration, allow reading/writing bucketing configs to file #345

Add bucket calibration, allow reading/writing bucketing configs to file #345

Uh oh!

kzawora-intel commented Sep 27, 2024 •

edited

Loading

Uh oh!

michalkuligowski Oct 1, 2024

Uh oh!

michalkuligowski Oct 1, 2024

Uh oh!

michalkuligowski Oct 1, 2024

Uh oh!

kzawora-intel Oct 1, 2024

Uh oh!

szutenberg left a comment

Uh oh!

github-actions bot commented Feb 7, 2025

Uh oh!

github-actions bot commented Mar 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add bucket calibration, allow reading/writing bucketing configs to file #345

Add bucket calibration, allow reading/writing bucketing configs to file #345

Uh oh!

Conversation

kzawora-intel commented Sep 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Design

Usage

Examples:

Calibration with unspecified output file

Calibration with specified output file

Loading calibration YAML

Default behavior:

Uh oh!

michalkuligowski Oct 1, 2024

Choose a reason for hiding this comment

Uh oh!

michalkuligowski Oct 1, 2024

Choose a reason for hiding this comment

Uh oh!

michalkuligowski Oct 1, 2024

Choose a reason for hiding this comment

Uh oh!

kzawora-intel Oct 1, 2024

Choose a reason for hiding this comment

Uh oh!

szutenberg left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Feb 7, 2025

Uh oh!

github-actions bot commented Mar 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kzawora-intel commented Sep 27, 2024 •

edited

Loading