Skip to content

Commit 022e3b5

Browse files
committed
Misc fixes for release scripts to make it easier to use
Summary: * removed requirement of setting VLLM_DIR environment, since benchmark is now a cli command * reordered the evals and summarization of results to match better with the order of model card Test Plan: local manual runs achiving desired results Reviewers: Subscribers: Tasks: Tags:
1 parent 18dbe87 commit 022e3b5

File tree

7 files changed

+105
-102
lines changed

7 files changed

+105
-102
lines changed

.github/scripts/torchao_model_releases/README.md

Lines changed: 57 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,51 @@
1-
# Scripts for torchao model release and eval
1+
# Scripts for torchao Model Release and Eval
22

3-
Note: all commands below are run in directory: `.github/scripts/torchao_model_releases/`
3+
Note: all commands below should be run in directory: `.github/scripts/torchao_model_releases/`
44

5-
## Release
5+
## Frequently Used Commands
6+
### Release and Eval Scripts for New Model Releases
7+
```
8+
MODEL=Qwen/Qwen3-8B
9+
# Releasing all models: INT4, INT8, INT8-INT4
10+
sh release.sh --model_id $MODEL --push_to_hub --populate_model_card_template
11+
12+
# INT8-INT4 requires additional steps to export and run so it's skipped from
13+
# general eval here
14+
# Need to set QMODEL_PREFIX properly before running eval
15+
# QMODEL_PREFIX=pytorch/Qwen3-8B
16+
sh eval.sh --model_ids $MODEL "$QMODEL_PREFIX-FP8" "$QMODEL_PREFIX-INT4"
17+
18+
# Some follow up evals
19+
sh eval.sh --eval_type latency --batch_size 256 "$QMODEL_PREFIX-FP8"
20+
sh eval.sh --eval_type quality --batch_size 256 "$QMODEL_PREFIX-INT8-INT4"
21+
22+
# Summarize all results
23+
sh summarize_results.sh --model_ids $MODEL "$QMODEL_PREFIX-FP8" "$QMODEL_PREFIX-INT4" "$QMODEL_PREFIX-INT8-INT4" "$QMODEL_PREFIX-AWQ-INT4"
24+
```
25+
26+
### AWQ Release and Eval
27+
```
28+
MODEL=Qwen/Qwen3-8B
29+
TASK=mmlu_abstract_algebra
30+
python quantize_and_upload.py --model_id $MODEL --quant AWQ-INT4 --push_to_hub --task $TASK --calibration_limit 10 --populate_model_card_template
31+
sh eval.sh --model_ids $MODEL "$QMODEL_PREFIX-AWQ-INT4"
32+
```
33+
34+
### Update Released Checkpoints in PyTorch
35+
Sometimes we may have to update the checkpoints under a different user name (organization) without changing the model card, e.g. for INT4
36+
```
37+
MODEL=Qwen/Qwen3-8B
38+
sh release.sh --model $MODEL --quants INT4 --push_to_hub --push_to_user_id pytorch
39+
```
40+
41+
Or AWQ checkpoint:
42+
```
43+
MODEL=Qwen/Qwen3-8B
44+
TASK=mmlu_abstract_algebra
45+
python quantize_and_upload.py --model_id $MODEL --quant AWQ-INT4--task $TASK --calibration_limit 10 --push_to_hub --push_to_user_id pytorch
46+
```
47+
48+
## Release Scripts
649
### default options
750
By default, we release FP8, INT4, INT8-INT4 checkpoints, with model card pre-filled with template content, that can be modified later after we have eval results.
851

@@ -12,10 +55,10 @@ Examples:
1255
# the logged in user
1356
1457
# release with default quant options (FP8, INT4, INT8-INT4)
15-
./release.sh --model_id Qwen/Qwen3-8B
58+
./release.sh --model_id Qwen/Qwen3-8B --push_to_hub
1659
1760
# release a custom set of quant options
18-
./release.sh --model_id Qwen/Qwen3-8B --quants INT4 FP8
61+
./release.sh --model_id Qwen/Qwen3-8B --quants INT4 FP8 --push_to_hub
1962
```
2063

2164
Note: for initial release, please include `--populate_model_card_template` to populate model card template.
@@ -41,7 +84,7 @@ sh release.sh --model_id microsoft/Phi-4-mini-instruct --quants FP8 --push_to_hu
4184

4285
This will update `pytorch/Phi-4-mini-instruct-FP8` without changing the model card.
4386

44-
## Eval
87+
## Eval Scripts
4588
After we run the release script for a model, we can find new models in the huggingface hub page for the user, e.g. https://huggingface.co/torchao-testing, the models will have a model card that's filled in with template content, such as information about the model and eval instructions, there are a few things we need to fill in, including 1. peak memory usage, 2. latency when running model with vllm and 3. quality measurement using lm-eval.
4689

4790
### Single Script
@@ -64,15 +107,15 @@ sh eval.sh --eval_type memory --model_ids Qwen/Qwen3-8B
64107
```
65108

66109
#### Latency Eval
67-
For latency eval, make sure vllm is cloned and installed from source,
68-
and `VLLM_DIR` should be set to the source directory of the cloned vllm repo.
110+
For latency eval, make sure vllm is installed.
111+
```
112+
uv pip install vllm
113+
```
114+
115+
Or install vllm nightly:
69116
```
70-
git clone https://github.com/vllm-project/vllm.git
71-
cd vllm
72-
VLLM_USE_PRECOMPILED=1 uv pip install --editable .
73-
export VLLM_DIR=path_to_vllm
117+
uv pip install vllm --pre --extra-index-url https://download.pytorch.org/whl/nightly/cu126
74118
```
75-
see https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html#set-up-using-python-only-build-without-compilation for more details.
76119

77120
After environment is setup, we can run eval:
78121
```
@@ -82,7 +125,7 @@ sh eval.sh --eval_type latency --model_ids Qwen/Qwen3-8B --batch_sizes 1,256
82125
#### Model Quality Eval
83126
For model quality eval, we need to install lm-eval
84127
```
85-
pip install lm-eval
128+
uv pip install lm-eval
86129
```
87130
After environment is setup, we can run eval:
88131
```

.github/scripts/torchao_model_releases/eval.sh

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -9,14 +9,14 @@ set -e
99
source eval_env_checks.sh
1010

1111
usage() {
12-
echo "Usage: $0 --eval_type <all|memory|latency|quality> --model_ids <model1> <model2> ... [--batch_sizes <batch_sizes>] [--tasks <tasks>]"
12+
echo "Usage: $0 --model_ids <model1> <model2> ... [--eval_type <all|memory|latency|quality>] [--batch_sizes <batch_sizes>] [--tasks <tasks>]"
1313
echo "Defaults:"
1414
echo " batch_sizes: 1 256"
1515
echo " tasks: mmlu"
1616
exit 1
1717
}
18-
EVAL_TYPE=""
1918
MODEL_ID_ARRAY=()
19+
EVAL_TYPE="all"
2020
# these will be parsed in the other scripts
2121
BATCH_SIZES="1 256" # Default for latency eval
2222
TASKS="mmlu" # Default for quality eval
@@ -64,8 +64,8 @@ while [[ $# -gt 0 ]]; do
6464
;;
6565
esac
6666
done
67-
if [[ -z "$EVAL_TYPE" || ${#MODEL_ID_ARRAY[@]} -eq 0 ]]; then
68-
echo "Error: --eval_type and --model_ids are required"
67+
if [[ ${#MODEL_ID_ARRAY[@]} -eq 0 ]]; then
68+
echo "Error: --model_ids is required"
6969
usage
7070
fi
7171

@@ -96,9 +96,9 @@ for MODEL_ID in "${MODEL_ID_ARRAY[@]}"; do
9696
run_quality "$MODEL_ID"
9797
;;
9898
all)
99+
run_quality "$MODEL_ID"
99100
run_memory "$MODEL_ID"
100101
run_latency "$MODEL_ID"
101-
run_quality "$MODEL_ID"
102102
;;
103103
*)
104104
echo "Unknown eval_type: $EVAL_TYPE"

.github/scripts/torchao_model_releases/eval_env_checks.sh

Lines changed: 1 addition & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -12,13 +12,8 @@ check_torch() {
1212
}
1313

1414
check_vllm() {
15-
# Check if VLLM_DIR is set
16-
if [ -z "$VLLM_DIR" ]; then
17-
echo "Error: VLLM_DIR environment variable is not set. Please set it before running this script."
18-
exit 1
19-
fi
2015
if ! pip show vllm > /dev/null 2>&1; then
21-
echo "Error: vllm package is NOT installed. please install from source: https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html#set-up-using-python-only-build-without-compilation" >&2
16+
echo "Error: vllm package is NOT installed. please install with `pip install vllm`" >&2
2217
exit 1
2318
fi
2419
}

.github/scripts/torchao_model_releases/eval_latency.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ source eval_env_checks.sh
1010
check_vllm
1111

1212
MODEL_ID_ARRAY=()
13-
BATCH_SIZE_ARRAY=(1 256) # default can be overwritten by user input
13+
BATCH_SIZE_ARRAY=(1) # default can be overwritten by user input
1414
INPUT_LEN="256" # default input length
1515
OUTPUT_LEN="256" # default output length
1616
# Parse arguments

.github/scripts/torchao_model_releases/eval_peak_memory_usage.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212

1313
def eval_peak_memory_usage(model_id: str):
1414
model = AutoModelForCausalLM.from_pretrained(
15-
model_id, device_map="auto", torch_dtype=torch.bfloat16
15+
model_id, device_map="cuda:0", torch_dtype=torch.bfloat16
1616
)
1717
tokenizer = AutoTokenizer.from_pretrained(model_id)
1818

.github/scripts/torchao_model_releases/quantize_and_upload.py

Lines changed: 20 additions & 57 deletions
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ def _get_username():
3636

3737
def _untie_weights_and_save_locally(model_id):
3838
untied_model = AutoModelForCausalLM.from_pretrained(
39-
model_id, torch_dtype="auto", device_map="auto"
39+
model_id, torch_dtype="auto", device_map="cuda:0"
4040
)
4141

4242
tokenizer = AutoTokenizer.from_pretrained(model_id)
@@ -209,15 +209,15 @@ def _untie_weights_and_save_locally(model_id):
209209
from torchao.quantization import Int4WeightOnlyConfig
210210
quant_config = Int4WeightOnlyConfig(group_size=128, int4_packing_format="tile_packed_to_4d", int4_choose_qparams_algorithm="hqq")
211211
quantization_config = TorchAoConfig(quant_type=quant_config)
212-
quantized_model = AutoModelForCausalLM.from_pretrained(model_to_quantize, device_map="auto", torch_dtype=torch.bfloat16, quantization_config=quantization_config)
212+
quantized_model = AutoModelForCausalLM.from_pretrained(model_to_quantize, device_map="cuda:0", torch_dtype=torch.bfloat16, quantization_config=quantization_config)
213213
tokenizer = AutoTokenizer.from_pretrained(model_id)
214214
"""
215215

216216
_fp8_quant_code = """
217217
from torchao.quantization import Float8DynamicActivationFloat8WeightConfig, PerRow
218218
quant_config = Float8DynamicActivationFloat8WeightConfig(granularity=PerRow())
219219
quantization_config = TorchAoConfig(quant_type=quant_config)
220-
quantized_model = AutoModelForCausalLM.from_pretrained(model_to_quantize, device_map="auto", torch_dtype=torch.bfloat16, quantization_config=quantization_config)
220+
quantized_model = AutoModelForCausalLM.from_pretrained(model_to_quantize, device_map="cuda:0", torch_dtype=torch.bfloat16, quantization_config=quantization_config)
221221
tokenizer = AutoTokenizer.from_pretrained(model_id)
222222
"""
223223

@@ -238,7 +238,7 @@ def _untie_weights_and_save_locally(model_id):
238238
)
239239
quant_config = ModuleFqnToConfig({{"_default": linear_config, "model.embed_tokens": embedding_config}})
240240
quantization_config = TorchAoConfig(quant_type=quant_config, include_input_output_embeddings=True, modules_to_not_convert=[])
241-
quantized_model = AutoModelForCausalLM.from_pretrained(model_to_quantize, device_map="auto", torch_dtype=torch.bfloat16, quantization_config=quantization_config)
241+
quantized_model = AutoModelForCausalLM.from_pretrained(model_to_quantize, device_map="cuda:0", torch_dtype=torch.bfloat16, quantization_config=quantization_config)
242242
tokenizer = AutoTokenizer.from_pretrained(model_id)
243243
"""
244244

@@ -250,12 +250,12 @@ def _untie_weights_and_save_locally(model_id):
250250
from torchao._models._eval import TransformerEvalWrapper
251251
model = AutoModelForCausalLM.from_pretrained(
252252
model_to_quantize,
253-
device_map="auto",
253+
device_map="cuda:0",
254254
torch_dtype=torch.bfloat16,
255255
)
256256
tokenizer = AutoTokenizer.from_pretrained(model_id)
257257
258-
base_config = Int4WeightOnlyConfig(group_size=128)
258+
base_config = Int4WeightOnlyConfig(group_size=128, int4_packing_format="tile_packed_to_4d", int4_choose_qparams_algorithm="hqq")
259259
quant_config = AWQConfig(base_config, step="prepare")
260260
quantize_(
261261
model,
@@ -333,7 +333,7 @@ def _untie_weights_and_save_locally(model_id):
333333
model = AutoModelForCausalLM.from_pretrained(
334334
model_name,
335335
torch_dtype="auto",
336-
device_map="auto"
336+
device_map="cuda:0"
337337
)
338338
339339
# prepare the model input
@@ -394,7 +394,7 @@ def _untie_weights_and_save_locally(model_id):
394394
395395
# use "{base_model}" or "{quantized_model}"
396396
model_id = "{quantized_model}"
397-
quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16)
397+
quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda:0", torch_dtype=torch.bfloat16)
398398
tokenizer = AutoTokenizer.from_pretrained(model_id)
399399
400400
torch.cuda.reset_peak_memory_stats()
@@ -438,7 +438,8 @@ def _untie_weights_and_save_locally(model_id):
438438
| Benchmark (Latency) | | |
439439
|----------------------------------|----------------|--------------------------|
440440
| | {base_model} | {quantized_model} |
441-
| latency (batch_size=1) | ?s | ?s (?x speedup) |
441+
| latency (batch_size=1) | ?s | ?s (?x speedup) |
442+
| latency (batch_size=256) | ?s | ?s (?x speedup) |
442443
443444
<details>
444445
<summary> Reproduce Model Performance Results </summary>
@@ -470,48 +471,6 @@ def _untie_weights_and_save_locally(model_id):
470471
export MODEL={quantized_model}
471472
VLLM_DISABLE_COMPILE_CACHE=1 python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model $MODEL --batch-size 1
472473
```
473-
474-
## benchmark_serving
475-
476-
We benchmarked the throughput in a serving environment.
477-
478-
Download sharegpt dataset:
479-
480-
```Shell
481-
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
482-
```
483-
484-
485-
486-
Other datasets can be found in: https://github.com/vllm-project/vllm/tree/main/benchmarks
487-
488-
Note: you can change the number of prompts to be benchmarked with `--num-prompts` argument for `benchmark_serving` script.
489-
490-
### baseline
491-
Server:
492-
```Shell
493-
export MODEL={base_model}
494-
vllm serve $MODEL --tokenizer $MODEL -O3
495-
```
496-
497-
Client:
498-
```Shell
499-
export MODEL={base_model}
500-
python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer $MODEL --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model $MODEL --num-prompts 1
501-
```
502-
503-
### {quant}
504-
Server:
505-
```Shell
506-
export MODEL={quantized_model}
507-
VLLM_DISABLE_COMPILE_CACHE=1 vllm serve $MODEL --tokenizer $MODEL -O3 --pt-load-map-location cuda:0
508-
```
509-
510-
Client:
511-
```Shell
512-
export MODEL={quantized_model}
513-
python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer $MODEL --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model $MODEL --num-prompts 1
514-
```
515474
</details>
516475
"""
517476

@@ -538,7 +497,7 @@ def _untie_weights_and_save_locally(model_id):
538497
import torch
539498
540499
model_id = "{base_model}"
541-
untied_model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
500+
untied_model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="cuda:0")
542501
tokenizer = AutoTokenizer.from_pretrained(model_id)
543502
544503
print(untied_model)
@@ -592,7 +551,7 @@ def _untie_weights_and_save_locally(model_id):
592551
python -m executorch.examples.models.qwen3.convert_weights $(hf download {quantized_model}) pytorch_model_converted.bin
593552
```
594553
595-
Once we have the checkpoint, we export it to ExecuTorch with a max_seq_length/max_context_length of 1024 to the XNNPACK backend as follows.
554+
Once we have the checkpoint, we export it to ExecuTorch with a max_seq_length/max_context_length of 1024 to the XNNPACK backend as follows.
596555
597556
[TODO: fix config path in note where necessary]
598557
(Note: ExecuTorch LLM export script requires config.json have certain key names. The correct config to use for the LLM export script is located at examples/models/qwen3/config/4b_config.json within the ExecuTorch repo.)
@@ -611,7 +570,7 @@ def _untie_weights_and_save_locally(model_id):
611570
--max_context_length 1024 \
612571
--max_seq_length 1024 \
613572
--dtype fp32 \
614-
--metadata '{"get_bos_id":199999, "get_eos_ids":[200020,199999]}'
573+
--metadata '{{"get_bos_id":199999, "get_eos_ids":[200020,199999]}}'
615574
```
616575
617576
After that you can run the model in a mobile app (see [Running in a mobile app](#running-in-a-mobile-app)).
@@ -672,12 +631,16 @@ def quantize_and_upload(
672631
assert quant == "AWQ-INT4", "Only support AWQ-INT4 for now"
673632
model = AutoModelForCausalLM.from_pretrained(
674633
model_to_quantize,
675-
device_map="auto",
634+
device_map="cuda:0",
676635
torch_dtype=torch.bfloat16,
677636
)
678637
tokenizer = AutoTokenizer.from_pretrained(model_id)
679638

680-
base_config = Int4WeightOnlyConfig(group_size=128)
639+
base_config = Int4WeightOnlyConfig(
640+
group_size=128,
641+
int4_packing_format="tile_packed_to_4d",
642+
int4_choose_qparams_algorithm="hqq",
643+
)
681644
quant_config = AWQConfig(base_config, step="prepare")
682645
quantize_(
683646
model,
@@ -712,7 +675,7 @@ def quantize_and_upload(
712675
)
713676
quantized_model = AutoModelForCausalLM.from_pretrained(
714677
model_to_quantize,
715-
device_map="auto",
678+
device_map="cuda:0",
716679
torch_dtype=torch.bfloat16,
717680
quantization_config=quantization_config,
718681
)

0 commit comments

Comments
 (0)