Skip to content

Conversation

@kzawora-intel
Copy link

@kzawora-intel kzawora-intel commented Sep 27, 2024

This PR adds allows user to calibrate bucket usage and load bucket configuration from file.

Design

When VLLM_HPU_CALIBRATE_BUCKETS=true env var is passed, warmup will be disabled, and upon destruction, the server will store bucket configs and utilized buckets in YAML file (optionally defined in VLLM_HPU_BUCKET_CFG env var).

An example YAML file looks as follows:

bucket_cfg:
  prompt_bs_bucket_cfg: {min: 1, step: 32, max: 64}
  prompt_seq_bucket_cfg: {min: 128, step: 128, max: 1280}
  decode_bs_bucket_cfg: {min: 1, step: 32, max: 16}
  decode_block_bucket_cfg: {min: 128, step: 128, max: 128}
buckets:
  decode:
  - [16, 128]
  - [8, 128]
  - [4, 128]
  - [2, 128]
  - [1, 128]
  prefill:
  - [64, 128]
  - [4, 1280]
  - [4, 1152]
  - [1, 1152]

Optionally, user can also emit CSV with buckets (useful for data analysis using external tools):

phase,batch_size,seq_or_block
prefill,64,128
prefill,4,1280
prefill,4,1152
prefill,1,1152
decode,16,128
decode,8,128
decode,4,128
decode,2,128
decode,1,128

In CSV mode there is no way to dump the bucket_cfg data.

  • User can manually alter the file and change the bucket settings as needed.
  • (only YAML) Ranges of bucket parameters (min/max) in generated YAML file are updated to include used buckets - i.e. if a given workload uses prefills with sequence length of 1280, and default max prefill sequence length is 1024, it will be extended to 1280.
  • (only YAML) User can also remove the "buckets" section and only provide bucket min/step/max settings, in which case buckets will be generated during runtime using provided settings - this can serve as an alternative for providing VLLM_{phase}_{dim}_BUCKET_{param} environment variables.
  • (only YAML) User can also remove the "bucket_cfg" section and only provide the list of buckets. In that case, buckets might be out of range w.r.t. bucket settings.
  • (only YAML) VLLM_{phase}_{dim}_BUCKET_{param} environment variables override values provided in the YAML file

Usage

  • If VLLM_HPU_CALIBRATE_BUCKETS is true or 1 and VLLM_HPU_BUCKET_CFG is not provided, calibration will happen, and calibration results will be saved to hpu-buckets-{vllm_instance_id}.yaml.
  • If VLLM_HPU_CALIBRATE_BUCKETS is true or 1 and VLLM_HPU_BUCKET_CFG is provided, calibration will happen, and calibration results will be saved to a file path defined by VLLM_HPU_BUCKET_CFG. If extension of VLLM_HPU_BUCKET_CFG is .csv (case insensitive), buckets will be saved in CSV format, if extension is yml or yaml (case insensitive), buckets and their ranges will be saved in YAML format.
  • If VLLM_HPU_CALIBRATE_BUCKETS is not true or 1 and VLLM_HPU_BUCKET_CFG is provided, calibration will not happen, and bucket settings will be loaded from a file path defined by VLLM_HPU_BUCKET_CFG (both YAML and CSV supported)
  • If VLLM_HPU_CALIBRATE_BUCKETS is not true or 1 and VLLM_HPU_BUCKET_CFG is not provided, calibration will not happen, and bucket generation will not be altered in any way (default behavior)

Examples:

Calibration with unspecified output file

Input:

VLLM_HPU_CALIBRATE_BUCKETS=true vllm serve ...

Output:

...
INFO 09-30 14:24:24 selector.py:147] Using HabanaAttention backend.
**INFO 09-30 14:24:24 habana_model_runner.py:575] Calibration results will be saved to hpu-buckets-vllm-instance-05d1c80d2f4541819d95b508117b130c.yaml**
INFO 09-30 14:24:24 habana_model_runner.py:99] VLLM_PROMPT_BS_BUCKET_MIN=1 (default:1)
INFO 09-30 14:24:24 habana_model_runner.py:99] VLLM_PROMPT_BS_BUCKET_STEP=32 (default:32)
INFO 09-30 14:24:24 habana_model_runner.py:99] VLLM_PROMPT_BS_BUCKET_MAX=64 (default:64)
INFO 09-30 14:24:24 habana_model_runner.py:99] VLLM_DECODE_BS_BUCKET_MIN=1 (default:1)
INFO 09-30 14:24:24 habana_model_runner.py:99] VLLM_DECODE_BS_BUCKET_STEP=32 (default:32)
INFO 09-30 14:24:24 habana_model_runner.py:99] VLLM_DECODE_BS_BUCKET_MAX=64 (default:64)
INFO 09-30 14:24:24 habana_model_runner.py:99] VLLM_PROMPT_SEQ_BUCKET_MIN=128 (default:128)
INFO 09-30 14:24:24 habana_model_runner.py:99] VLLM_PROMPT_SEQ_BUCKET_STEP=128 (default:128)
INFO 09-30 14:24:24 habana_model_runner.py:99] VLLM_PROMPT_SEQ_BUCKET_MAX=1024 (default:1024)
INFO 09-30 14:24:24 habana_model_runner.py:99] VLLM_DECODE_BLOCK_BUCKET_MIN=128 (default:128)
INFO 09-30 14:24:24 habana_model_runner.py:99] VLLM_DECODE_BLOCK_BUCKET_STEP=128 (default:128)
INFO 09-30 14:24:24 habana_model_runner.py:99] VLLM_DECODE_BLOCK_BUCKET_MAX=2048 (default:2048)
INFO 09-30 14:24:24 habana_model_runner.py:758] Prompt bucket config (min, step, max_warmup) bs:[1, 32, 64], seq:[128, 128, 1024]
INFO 09-30 14:24:24 habana_model_runner.py:763] Decode bucket config (min, step, max_warmup) bs:[1, 32, 64], block:[128, 128, 2048]
...
INFO 09-30 14:24:35 habana_executor.py:87] # HPU blocks: 3184, # CPU blocks: 256
INFO 09-30 14:24:36 habana_worker.py:212] Initializing cache engine took 49.75 GiB of device memory (65.4 GiB/94.62 GiB used) and 1.984 GiB of host memory (541.3 GiB/1007 GiB used)
**INFO 09-30 14:24:36 habana_model_runner.py:1646] Skipping warmup...**
INFO 09-30 14:24:36 habana_executor.py:93] init_cache_engine took 49.75 GiB of device memory (65.4 GiB/94.62 GiB used) and 1.992 GiB of host memory (541.3 GiB/1007 GiB used)
...
**INFO 09-30 14:26:40 habana_model_runner.py:803] Bucket calibration settings saved to hpu-buckets-vllm-instance-05d1c80d2f4541819d95b508117b130c.yaml**

Calibration with specified output file

Input:

VLLM_HPU_CALIBRATE_BUCKETS=true VLLM_HPU_BUCKET_CFG=blabla.yml vllm serve ...

Output log:

...
INFO 09-30 14:03:30 selector.py:147] Using HabanaAttention backend.
**INFO 09-30 14:03:30 habana_model_runner.py:575] Calibration results will be saved to blabla.yml**
INFO 09-30 14:03:30 habana_model_runner.py:99] VLLM_PROMPT_BS_BUCKET_MIN=1 (default:1)
INFO 09-30 14:03:30 habana_model_runner.py:99] VLLM_PROMPT_BS_BUCKET_STEP=32 (default:32)
INFO 09-30 14:03:30 habana_model_runner.py:99] VLLM_PROMPT_BS_BUCKET_MAX=64 (default:64)
INFO 09-30 14:03:30 habana_model_runner.py:99] VLLM_DECODE_BS_BUCKET_MIN=1 (default:1)
INFO 09-30 14:03:30 habana_model_runner.py:99] VLLM_DECODE_BS_BUCKET_STEP=32 (default:32)
INFO 09-30 14:03:30 habana_model_runner.py:99] VLLM_DECODE_BS_BUCKET_MAX=64 (default:64)
INFO 09-30 14:03:30 habana_model_runner.py:99] VLLM_PROMPT_SEQ_BUCKET_MIN=128 (default:128)
INFO 09-30 14:03:30 habana_model_runner.py:99] VLLM_PROMPT_SEQ_BUCKET_STEP=128 (default:128)
INFO 09-30 14:03:30 habana_model_runner.py:99] VLLM_PROMPT_SEQ_BUCKET_MAX=1024 (default:1024)
INFO 09-30 14:03:30 habana_model_runner.py:99] VLLM_DECODE_BLOCK_BUCKET_MIN=128 (default:128)
INFO 09-30 14:03:30 habana_model_runner.py:99] VLLM_DECODE_BLOCK_BUCKET_STEP=128 (default:128)
INFO 09-30 14:03:30 habana_model_runner.py:99] VLLM_DECODE_BLOCK_BUCKET_MAX=2048 (default:2048)
INFO 09-30 14:03:30 habana_model_runner.py:758] Prompt bucket config (min, step, max_warmup) bs:[1, 32, 64], seq:[128, 128, 1024]
INFO 09-30 14:03:30 habana_model_runner.py:763] Decode bucket config (min, step, max_warmup) bs:[1, 32, 64], block:[128, 128, 2048]
...
INFO 09-30 14:03:42 habana_executor.py:87] # HPU blocks: 3184, # CPU blocks: 256
INFO 09-30 14:03:42 habana_worker.py:212] Initializing cache engine took 49.75 GiB of device memory (65.4 GiB/94.62 GiB used) and 1.971 GiB of host memory (537.7 GiB/1007 GiB used)
**INFO 09-30 14:03:42 habana_model_runner.py:1646] Skipping warmup...**
INFO 09-30 14:03:42 habana_executor.py:93] init_cache_engine took 49.75 GiB of device memory (65.4 GiB/94.62 GiB used) and 1.971 GiB of host memory (537.7 GiB/1007 GiB used)
...
**INFO 09-30 14:06:29 habana_model_runner.py:803] Bucket calibration settings saved to blabla.yml**

Calibration yaml:

bucket_cfg:
  prompt_bs_bucket_cfg: {min: 1, step: 32, max: 64}
  prompt_seq_bucket_cfg: {min: 128, step: 128, max: 1280}
  decode_bs_bucket_cfg: {min: 1, step: 32, max: 128}
  decode_block_bucket_cfg: {min: 128, step: 128, max: 1408}
buckets:
  decode:
  - [128, 1408]
  - [128, 1280]
  - [128, 1152]
  - [128, 1024]
  - [96, 1024]
  - [96, 896]
  - [96, 768]
  - [64, 640]
  - [64, 512]
  - [64, 384]
  - [32, 384]
  - [32, 256]
  - [16, 256]
  - [16, 128]
  - [8, 128]
  - [4, 128]
  - [2, 128]
  - [1, 128]
  prefill:
  - [64, 128]
  - [4, 1280]
  - [4, 1152]
  - [2, 1280]
  - [2, 1152]
  - [1, 1280]
  - [1, 1152]

Loading calibration YAML

Input:

VLLM_HPU_BUCKET_CFG=blabla.yml vllm serve ...

Output log:

...
INFO 09-30 14:29:02 selector.py:147] Using HabanaAttention backend.
**INFO 09-30 14:29:02 habana_model_runner.py:99] VLLM_PROMPT_BS_BUCKET_MIN=1 (VLLM_HPU_BUCKET_CFG:1)**
**INFO 09-30 14:29:02 habana_model_runner.py:99] VLLM_PROMPT_BS_BUCKET_STEP=32 (VLLM_HPU_BUCKET_CFG:32)**
**INFO 09-30 14:29:02 habana_model_runner.py:99] VLLM_PROMPT_BS_BUCKET_MAX=64 (VLLM_HPU_BUCKET_CFG:64)**
**INFO 09-30 14:29:02 habana_model_runner.py:99] VLLM_DECODE_BS_BUCKET_MIN=1 (VLLM_HPU_BUCKET_CFG:1)**
**INFO 09-30 14:29:02 habana_model_runner.py:99] VLLM_DECODE_BS_BUCKET_STEP=32 (VLLM_HPU_BUCKET_CFG:32)**
**INFO 09-30 14:29:02 habana_model_runner.py:99] VLLM_DECODE_BS_BUCKET_MAX=64 (VLLM_HPU_BUCKET_CFG:64)**
**INFO 09-30 14:29:02 habana_model_runner.py:99] VLLM_PROMPT_SEQ_BUCKET_MIN=128 (VLLM_HPU_BUCKET_CFG:128)**
**INFO 09-30 14:29:02 habana_model_runner.py:99] VLLM_PROMPT_SEQ_BUCKET_STEP=128 (VLLM_HPU_BUCKET_CFG:128)**
**INFO 09-30 14:29:02 habana_model_runner.py:99] VLLM_PROMPT_SEQ_BUCKET_MAX=1024 (VLLM_HPU_BUCKET_CFG:1024)**
**INFO 09-30 14:29:02 habana_model_runner.py:99] VLLM_DECODE_BLOCK_BUCKET_MIN=128 (VLLM_HPU_BUCKET_CFG:128)**
**INFO 09-30 14:29:02 habana_model_runner.py:99] VLLM_DECODE_BLOCK_BUCKET_STEP=128 (VLLM_HPU_BUCKET_CFG:128)**
**INFO 09-30 14:29:02 habana_model_runner.py:99] VLLM_DECODE_BLOCK_BUCKET_MAX=2048 (VLLM_HPU_BUCKET_CFG:2048)**
INFO 09-30 14:29:02 habana_model_runner.py:758] Prompt bucket config (min, step, max_warmup) bs:[1, 32, 64], seq:[128, 128, 1024]
INFO 09-30 14:29:02 habana_model_runner.py:763] Decode bucket config (min, step, max_warmup) bs:[1, 32, 64], block:[128, 128, 2048]
**INFO 09-30 14:29:02 habana_model_runner.py:769] Loaded 7 prompt buckets from file [bs, seq]: [(1, 1152), (1, 1280), (2, 1152), (2, 1280), (4, 1152), (4, 1280), (64, 128)]**
**INFO 09-30 14:29:02 habana_model_runner.py:775] Loaded 18 decode buckets from file [bs, block]: [(1, 128), (2, 128), (4, 128), (8, 128), (16, 128), (16, 256), (32, 256), (32, 384), (64, 384), (64, 512), (64, 640), (96, 768), (96, 896), (96, 1024), (128, 1024), (128, 1152), (128, 1280), (128, 1408)]**
...
INFO 09-30 14:29:13 habana_executor.py:87] # HPU blocks: 3184, # CPU blocks: 256
INFO 09-30 14:29:14 habana_worker.py:212] Initializing cache engine took 49.75 GiB of device memory (65.4 GiB/94.62 GiB used) and 1.982 GiB of host memory (550.4 GiB/1007 GiB used)
INFO 09-30 14:29:14 habana_model_runner.py:1665] Generated 7 prompt buckets [bs, seq]: [(1, 1152), (1, 1280), (2, 1152), (2, 1280), (4, 1152), (4, 1280), (64, 128)]
INFO 09-30 14:29:14 habana_model_runner.py:1670] Omitted 0 prompt buckets due to exceeded token budget (max_num_batched_tokens=4096)
INFO 09-30 14:29:14 habana_model_runner.py:1684] Generated 18 decode buckets [bs, total_blocks]: [(1, 128), (2, 128), (4, 128), (8, 128), (16, 128), (16, 256), (32, 256), (32, 384), (64, 384), (64, 512), (64, 640), (96, 768), (96, 896), (96, 1024), (128, 1024), (128, 1152), (128, 1280), (128, 1408)]
INFO 09-30 14:29:14 habana_model_runner.py:1568] [Warmup][Prompt][1/7] batch_size:1 seq_len:1152 free_mem:29.23 GiB
...
INFO 09-30 14:29:19 habana_model_runner.py:1568] [Warmup][Prompt][7/7] batch_size:64 seq_len:128 free_mem:29.23 GiB
INFO 09-30 14:29:19 habana_model_runner.py:1568] [Warmup][Decode][1/18] batch_size:1 num_blocks:128 free_mem:29.23 GiB
...
INFO 09-30 14:29:27 habana_model_runner.py:1568] [Warmup][Decode][18/18] batch_size:128 num_blocks:1408 free_mem:29.23 GiB
INFO 09-30 14:29:27 habana_model_runner.py:1741] Using 21.33 GiB/29.23 GiB of free device memory for HPUGraphs, 6.399 GiB for prompt and 14.93 GiB for decode (VLLM_GRAPH_PROMPT_RATIO=0.3)
INFO 09-30 14:29:27 habana_model_runner.py:1568] [Warmup][Graph/Prompt][1/7] batch_size:1 num_blocks:1152 free_mem:29.23 GiB
...
INFO 09-30 14:29:30 habana_model_runner.py:1568] [Warmup][Graph/Prompt][7/7] batch_size:64 num_blocks:128 free_mem:29.17 GiB
INFO 09-30 14:29:30 habana_model_runner.py:1568] [Warmup][Graph/Decode][1/18] batch_size:128 num_blocks:1024 free_mem:29.16 GiB
...
INFO 09-30 14:29:38 habana_model_runner.py:1568] [Warmup][Graph/Decode][18/18] batch_size:1 num_blocks:128 free_mem:29.15 GiB
INFO 09-30 14:29:38 habana_model_runner.py:1632] Graph/Prompt captured:7 (100.0%) used_mem:64.04 MiB buckets:[(1, 1152), (1, 1280), (2, 1152), (2, 1280), (4, 1152), (4, 1280), (64, 128)]
INFO 09-30 14:29:38 habana_model_runner.py:1632] Graph/Decode captured:18 (100.0%) used_mem:11.23 MiB buckets:[(1, 128), (2, 128), (4, 128), (8, 128), (16, 128), (16, 256), (32, 256), (32, 384), (64, 384), (64, 512), (64, 640), (96, 768), (96, 896), (96, 1024), (128, 1024), (128, 1152), (128, 1280), (128, 1408)]
INFO 09-30 14:29:38 habana_model_runner.py:1789] Warmup finished in 25 secs, allocated 75.27 MiB of device memory
...

Default behavior:

Input:

vllm serve ...

Output log:

INFO 09-30 14:31:52 habana_executor.py:87] # HPU blocks: 3184, # CPU blocks: 256
INFO 09-30 14:31:53 habana_worker.py:212] Initializing cache engine took 49.75 GiB of device memory (65.4 GiB/94.62 GiB used) and 1.978 GiB of host memory (538.9 GiB/1007 GiB used)
INFO 09-30 14:31:53 habana_model_runner.py:1665] Generated 31 prompt buckets [bs, seq]: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (8, 128), (8, 256), (8, 384), (8, 512), (16, 128), (16, 256), (32, 128)]
INFO 09-30 14:31:53 habana_model_runner.py:1670] Omitted 25 prompt buckets due to exceeded token budget (max_num_batched_tokens=4096)
INFO 09-30 14:31:53 habana_model_runner.py:1684] Generated 112 decode buckets [bs, total_blocks]: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (1, 1280), (1, 1408), (1, 1536), (1, 1664), (1, 1792), (1, 1920), (1, 2048), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (2, 1152), (2, 1280), (2, 1408), (2, 1536), (2, 1664), (2, 1792), (2, 1920), (2, 2048), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (4, 1152), (4, 1280), (4, 1408), (4, 1536), (4, 1664), (4, 1792), (4, 1920), (4, 2048), (8, 128), (8, 256), (8, 384), (8, 512), (8, 640), (8, 768), (8, 896), (8, 1024), (8, 1152), (8, 1280), (8, 1408), (8, 1536), (8, 1664), (8, 1792), (8, 1920), (8, 2048), (16, 128), (16, 256), (16, 384), (16, 512), (16, 640), (16, 768), (16, 896), (16, 1024), (16, 1152), (16, 1280), (16, 1408), (16, 1536), (16, 1664), (16, 1792), (16, 1920), (16, 2048), (32, 128), (32, 256), (32, 384), (32, 512), (32, 640), (32, 768), (32, 896), (32, 1024), (32, 1152), (32, 1280), (32, 1408), (32, 1536), (32, 1664), (32, 1792), (32, 1920), (32, 2048), (64, 128), (64, 256), (64, 384), (64, 512), (64, 640), (64, 768), (64, 896), (64, 1024), (64, 1152), (64, 1280), (64, 1408), (64, 1536), (64, 1664), (64, 1792), (64, 1920), (64, 2048)]
INFO 09-30 14:31:53 habana_model_runner.py:1568] [Warmup][Prompt][1/31] batch_size:4 seq_len:1024 free_mem:29.23 GiB
...
INFO 09-30 14:32:12 habana_model_runner.py:1568] [Warmup][Prompt][31/31] batch_size:1 seq_len:128 free_mem:29.23 GiB
INFO 09-30 14:32:12 habana_model_runner.py:1568] [Warmup][Decode][1/112] batch_size:64 num_blocks:2048 free_mem:29.23 GiB
...
INFO 09-30 14:32:56 habana_model_runner.py:1568] [Warmup][Decode][112/112] batch_size:1 num_blocks:128 free_mem:29.23 GiB
INFO 09-30 14:32:57 habana_model_runner.py:1741] Using 21.33 GiB/29.23 GiB of free device memory for HPUGraphs, 6.399 GiB for prompt and 14.93 GiB for decode (VLLM_GRAPH_PROMPT_RATIO=0.3)
INFO 09-30 14:32:57 habana_model_runner.py:1568] [Warmup][Graph/Prompt][1/31] batch_size:1 num_blocks:128 free_mem:29.23 GiB
...
INFO 09-30 14:33:09 habana_model_runner.py:1568] [Warmup][Graph/Prompt][31/31] batch_size:4 num_blocks:1024 free_mem:29.16 GiB
INFO 09-30 14:33:10 habana_model_runner.py:1568] [Warmup][Graph/Decode][1/112] batch_size:64 num_blocks:128 free_mem:29.14 GiB
...
INFO 09-30 14:33:57 habana_model_runner.py:1568] [Warmup][Graph/Decode][112/112] batch_size:1 num_blocks:2048 free_mem:29.1 GiB
INFO 09-30 14:33:57 habana_model_runner.py:1632] Graph/Prompt captured:31 (100.0%) used_mem:85.03 MiB buckets:[(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (8, 128), (8, 256), (8, 384), (8, 512), (16, 128), (16, 256), (32, 128)]
INFO 09-30 14:33:57 habana_model_runner.py:1632] Graph/Decode captured:112 (100.0%) used_mem:45.65 MiB buckets:[(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (1, 1280), (1, 1408), (1, 1536), (1, 1664), (1, 1792), (1, 1920), (1, 2048), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (2, 1152), (2, 1280), (2, 1408), (2, 1536), (2, 1664), (2, 1792), (2, 1920), (2, 2048), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (4, 1152), (4, 1280), (4, 1408), (4, 1536), (4, 1664), (4, 1792), (4, 1920), (4, 2048), (8, 128), (8, 256), (8, 384), (8, 512), (8, 640), (8, 768), (8, 896), (8, 1024), (8, 1152), (8, 1280), (8, 1408), (8, 1536), (8, 1664), (8, 1792), (8, 1920), (8, 2048), (16, 128), (16, 256), (16, 384), (16, 512), (16, 640), (16, 768), (16, 896), (16, 1024), (16, 1152), (16, 1280), (16, 1408), (16, 1536), (16, 1664), (16, 1792), (16, 1920), (16, 2048), (32, 128), (32, 256), (32, 384), (32, 512), (32, 640), (32, 768), (32, 896), (32, 1024), (32, 1152), (32, 1280), (32, 1408), (32, 1536), (32, 1664), (32, 1792), (32, 1920), (32, 2048), (64, 128), (64, 256), (64, 384), (64, 512), (64, 640), (64, 768), (64, 896), (64, 1024), (64, 1152), (64, 1280), (64, 1408), (64, 1536), (64, 1664), (64, 1792), (64, 1920), (64, 2048)]
INFO 09-30 14:33:57 habana_model_runner.py:1789] Warmup finished in 124 secs, allocated 130.7 MiB of device memory

Warmup time and number of buckets has drastically decreased, as only buckets used at least once by given workload are used, and the remaining ones are discarded.

@kzawora-intel kzawora-intel marked this pull request as draft September 27, 2024 19:05
@kzawora-intel kzawora-intel changed the title Draft: Add bucket calibration, allow reading/writing bucketing configs to file Add bucket calibration, allow reading/writing bucketing configs to file Sep 30, 2024
@kzawora-intel kzawora-intel marked this pull request as ready for review September 30, 2024 11:36
import pandas as pd

def yaml_serializer(df, bucket_cfg_file):
import yaml

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add pyyaml to requirements?

return {'min': cfg[0], 'step': cfg[1], 'max': cfg[2]}

data: Dict[str, Any] = {} # type: ignore
#data['buckets'] = df.to_dict(orient='records')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commented code

logger.warning("Configuration: (%s, %s, %s) was not warmed-up!",
phase, batch_size, seq_len)
if not self.calibrate_buckets:
logger.warning(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to divide code into 3 lines? We now have wide displays and it does nt make it more readable

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to divide code into 3 lines?

I wish we didn't, but format.sh made this into such an abomination.

Copy link

@szutenberg szutenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kzawora-intel ,

Could you please add documentation to README_GAUDI.md?
I don't understand motivation behind this feature and how it can be used in the production.
Do you plan to upstream it?

Let's have a look at your example:

  • (decode, [128, 1024]) => fine, it will be warmed up
  • (decode, [128, 1152]) => fine, it will be warmed up
  • (decode, [128, 896]) => it will not be warmed up

What will happen in the third example:

  • vllm will use warmed up (decode, [128, 1024])?
  • vllm will compile (decode, [128, 896])?

IMHO the second option would be non-intuitive and would make this feature not usable in the production.

@kzawora-intel kzawora-intel added the habana Issues or PRs submitted by Habana Labs label Nov 8, 2024
@github-actions
Copy link

github-actions bot commented Feb 7, 2025

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

@github-actions github-actions bot added the stale label Feb 7, 2025
@github-actions
Copy link

This pull request has been automatically closed due to inactivity. Please feel free to reopen if you intend to continue working on it. Thank you!

@github-actions github-actions bot closed this Mar 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

habana Issues or PRs submitted by Habana Labs stale

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants