Misc fixes for release scripts to make it easier to use

jerryzh168 · jerryzh168 · commit 022e3b59ee8e · 2025-09-19T17:44:22.000-07:00
Summary:
* removed requirement of setting VLLM_DIR environment, since benchmark is now a cli command
* reordered the evals and summarization of results to match better with the order of model card

Test Plan:
local manual runs achiving desired results

Reviewers:

Subscribers:

Tasks:

Tags:
diff --git a/.github/scripts/torchao_model_releases/README.md b/.github/scripts/torchao_model_releases/README.md
@@ -1,8 +1,51 @@
-# Scripts for torchao model release and eval
+# Scripts for torchao Model Release and Eval
 
-Note: all commands below are run in directory: `.github/scripts/torchao_model_releases/`
+Note: all commands below should be run in directory: `.github/scripts/torchao_model_releases/`
 
-## Release
+## Frequently Used Commands
+### Release and Eval Scripts for New Model Releases
+```
+MODEL=Qwen/Qwen3-8B
+# Releasing all models: INT4, INT8, INT8-INT4
+sh release.sh --model_id $MODEL --push_to_hub --populate_model_card_template
+
+# INT8-INT4 requires additional steps to export and run so it's skipped from
+# general eval here
+# Need to set QMODEL_PREFIX properly before running eval
+# QMODEL_PREFIX=pytorch/Qwen3-8B
+sh eval.sh --model_ids $MODEL "$QMODEL_PREFIX-FP8" "$QMODEL_PREFIX-INT4"
+
+# Some follow up evals
+sh eval.sh --eval_type latency --batch_size 256 "$QMODEL_PREFIX-FP8"
+sh eval.sh --eval_type quality --batch_size 256 "$QMODEL_PREFIX-INT8-INT4"
+
+# Summarize all results
+sh summarize_results.sh --model_ids $MODEL "$QMODEL_PREFIX-FP8" "$QMODEL_PREFIX-INT4" "$QMODEL_PREFIX-INT8-INT4" "$QMODEL_PREFIX-AWQ-INT4"
+```
+
+### AWQ Release and Eval
+```
+MODEL=Qwen/Qwen3-8B
+TASK=mmlu_abstract_algebra
+python quantize_and_upload.py --model_id $MODEL --quant AWQ-INT4 --push_to_hub --task $TASK --calibration_limit 10 --populate_model_card_template
+sh eval.sh --model_ids $MODEL "$QMODEL_PREFIX-AWQ-INT4"
+```
+
+### Update Released Checkpoints in PyTorch
+Sometimes we may have to update the checkpoints under a different user name (organization) without changing the model card, e.g. for INT4
+```
+MODEL=Qwen/Qwen3-8B
+sh release.sh --model $MODEL --quants INT4 --push_to_hub --push_to_user_id pytorch
+```
+
+Or AWQ checkpoint:
+```
+MODEL=Qwen/Qwen3-8B
+TASK=mmlu_abstract_algebra
+python quantize_and_upload.py --model_id $MODEL --quant AWQ-INT4--task $TASK --calibration_limit 10 --push_to_hub --push_to_user_id pytorch
+```
+
+## Release Scripts
 ### default options
 By default, we release FP8, INT4, INT8-INT4 checkpoints, with model card pre-filled with template content, that can be modified later after we have eval results.
 
@@ -12,10 +55,10 @@ Examples:
 # the logged in user
 
 # release with default quant options (FP8, INT4, INT8-INT4)
-./release.sh --model_id Qwen/Qwen3-8B
+./release.sh --model_id Qwen/Qwen3-8B --push_to_hub
 
 # release a custom set of quant options
-./release.sh --model_id Qwen/Qwen3-8B --quants INT4 FP8
+./release.sh --model_id Qwen/Qwen3-8B --quants INT4 FP8 --push_to_hub
 ```
 
 Note: for initial release, please include `--populate_model_card_template` to populate model card template.
@@ -41,7 +84,7 @@ sh release.sh --model_id microsoft/Phi-4-mini-instruct --quants FP8 --push_to_hu
 
 This will update `pytorch/Phi-4-mini-instruct-FP8` without changing the model card.
 
-## Eval
+## Eval Scripts
 After we run the release script for a model, we can find new models in the huggingface hub page for the user, e.g. https://huggingface.co/torchao-testing, the models will have a model card that's filled in with template content, such as information about the model and eval instructions, there are a few things we need to fill in, including 1. peak memory usage, 2. latency when running model with vllm and 3. quality measurement using lm-eval.
 
 ### Single Script
@@ -64,15 +107,15 @@ sh eval.sh --eval_type memory --model_ids Qwen/Qwen3-8B
 ```
 
 #### Latency Eval
-For latency eval, make sure vllm is cloned and installed from source,
-and `VLLM_DIR` should be set to the source directory of the cloned vllm repo.
+For latency eval, make sure vllm is installed.
+```
+uv pip install vllm
+```
+
+Or install vllm nightly:
 ```
-git clone https://github.com/vllm-project/vllm.git
-cd vllm
-VLLM_USE_PRECOMPILED=1 uv pip install --editable .
-export VLLM_DIR=path_to_vllm
+uv pip install vllm --pre --extra-index-url https://download.pytorch.org/whl/nightly/cu126
 ```
-see https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html#set-up-using-python-only-build-without-compilation for more details.
 
 After environment is setup, we can run eval:
 ```
@@ -82,7 +125,7 @@ sh eval.sh --eval_type latency --model_ids Qwen/Qwen3-8B --batch_sizes 1,256
 #### Model Quality Eval
 For model quality eval, we need to install lm-eval
 ```
-pip install lm-eval
+uv pip install lm-eval
 ```
 After environment is setup, we can run eval:
 ```
diff --git a/.github/scripts/torchao_model_releases/eval.sh b/.github/scripts/torchao_model_releases/eval.sh
@@ -9,14 +9,14 @@ set -e
 source eval_env_checks.sh
 
 usage() {
-  echo "Usage: $0 --eval_type <all|memory|latency|quality> --model_ids <model1> <model2> ... [--batch_sizes <batch_sizes>] [--tasks <tasks>]"
+  echo "Usage: $0 --model_ids <model1> <model2> ... [--eval_type <all|memory|latency|quality>] [--batch_sizes <batch_sizes>] [--tasks <tasks>]"
   echo "Defaults:"
   echo "  batch_sizes: 1 256"
   echo "  tasks: mmlu"
   exit 1
 }
-EVAL_TYPE=""
 MODEL_ID_ARRAY=()
+EVAL_TYPE="all"
 # these will be parsed in the other scripts
 BATCH_SIZES="1 256"    # Default for latency eval
 TASKS="mmlu"           # Default for quality eval
@@ -64,8 +64,8 @@ while [[ $# -gt 0 ]]; do
       ;;
   esac
 done
-if [[ -z "$EVAL_TYPE" || ${#MODEL_ID_ARRAY[@]} -eq 0 ]]; then
-  echo "Error: --eval_type and --model_ids are required"
+if [[ ${#MODEL_ID_ARRAY[@]} -eq 0 ]]; then
+  echo "Error: --model_ids is required"
   usage
 fi
 
@@ -96,9 +96,9 @@ for MODEL_ID in "${MODEL_ID_ARRAY[@]}"; do
       run_quality "$MODEL_ID"
       ;;
     all)
+      run_quality "$MODEL_ID"
       run_memory "$MODEL_ID"
       run_latency "$MODEL_ID"
-      run_quality "$MODEL_ID"
       ;;
     *)
       echo "Unknown eval_type: $EVAL_TYPE"
diff --git a/.github/scripts/torchao_model_releases/eval_env_checks.sh b/.github/scripts/torchao_model_releases/eval_env_checks.sh
@@ -12,13 +12,8 @@ check_torch() {
 }
 
 check_vllm() {
-  # Check if VLLM_DIR is set
-  if [ -z "$VLLM_DIR" ]; then
-    echo "Error: VLLM_DIR environment variable is not set. Please set it before running this script."
-    exit 1
-  fi
   if ! pip show vllm > /dev/null 2>&1; then
-    echo "Error: vllm package is NOT installed. please install from source: https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html#set-up-using-python-only-build-without-compilation" >&2
+    echo "Error: vllm package is NOT installed. please install with `pip install vllm`" >&2
     exit 1
   fi
 }
diff --git a/.github/scripts/torchao_model_releases/eval_latency.sh b/.github/scripts/torchao_model_releases/eval_latency.sh
@@ -10,7 +10,7 @@ source eval_env_checks.sh
 check_vllm
 
 MODEL_ID_ARRAY=()
-BATCH_SIZE_ARRAY=(1 256)  # default can be overwritten by user input
+BATCH_SIZE_ARRAY=(1)  # default can be overwritten by user input
 INPUT_LEN="256"      # default input length
 OUTPUT_LEN="256"     # default output length
 # Parse arguments
diff --git a/.github/scripts/torchao_model_releases/eval_peak_memory_usage.py b/.github/scripts/torchao_model_releases/eval_peak_memory_usage.py
@@ -12,7 +12,7 @@
 
 def eval_peak_memory_usage(model_id: str):
     model = AutoModelForCausalLM.from_pretrained(
-        model_id, device_map="auto", torch_dtype=torch.bfloat16
+        model_id, device_map="cuda:0", torch_dtype=torch.bfloat16
     )
     tokenizer = AutoTokenizer.from_pretrained(model_id)
 
diff --git a/.github/scripts/torchao_model_releases/quantize_and_upload.py b/.github/scripts/torchao_model_releases/quantize_and_upload.py
@@ -36,7 +36,7 @@ def _get_username():
 
 def _untie_weights_and_save_locally(model_id):
     untied_model = AutoModelForCausalLM.from_pretrained(
-        model_id, torch_dtype="auto", device_map="auto"
+        model_id, torch_dtype="auto", device_map="cuda:0"
     )
 
     tokenizer = AutoTokenizer.from_pretrained(model_id)
@@ -209,15 +209,15 @@ def _untie_weights_and_save_locally(model_id):
 from torchao.quantization import Int4WeightOnlyConfig
 quant_config = Int4WeightOnlyConfig(group_size=128, int4_packing_format="tile_packed_to_4d", int4_choose_qparams_algorithm="hqq")
 quantization_config = TorchAoConfig(quant_type=quant_config)
-quantized_model = AutoModelForCausalLM.from_pretrained(model_to_quantize, device_map="auto", torch_dtype=torch.bfloat16, quantization_config=quantization_config)
+quantized_model = AutoModelForCausalLM.from_pretrained(model_to_quantize, device_map="cuda:0", torch_dtype=torch.bfloat16, quantization_config=quantization_config)
 tokenizer = AutoTokenizer.from_pretrained(model_id)
 """
 
 _fp8_quant_code = """
 from torchao.quantization import Float8DynamicActivationFloat8WeightConfig, PerRow
 quant_config = Float8DynamicActivationFloat8WeightConfig(granularity=PerRow())
 quantization_config = TorchAoConfig(quant_type=quant_config)
-quantized_model = AutoModelForCausalLM.from_pretrained(model_to_quantize, device_map="auto", torch_dtype=torch.bfloat16, quantization_config=quantization_config)
+quantized_model = AutoModelForCausalLM.from_pretrained(model_to_quantize, device_map="cuda:0", torch_dtype=torch.bfloat16, quantization_config=quantization_config)
 tokenizer = AutoTokenizer.from_pretrained(model_id)
 """
 
@@ -238,7 +238,7 @@ def _untie_weights_and_save_locally(model_id):
 )
 quant_config = ModuleFqnToConfig({{"_default": linear_config, "model.embed_tokens": embedding_config}})
 quantization_config = TorchAoConfig(quant_type=quant_config, include_input_output_embeddings=True, modules_to_not_convert=[])
-quantized_model = AutoModelForCausalLM.from_pretrained(model_to_quantize, device_map="auto", torch_dtype=torch.bfloat16, quantization_config=quantization_config)
+quantized_model = AutoModelForCausalLM.from_pretrained(model_to_quantize, device_map="cuda:0", torch_dtype=torch.bfloat16, quantization_config=quantization_config)
 tokenizer = AutoTokenizer.from_pretrained(model_id)
 """
 
@@ -250,12 +250,12 @@ def _untie_weights_and_save_locally(model_id):
 from torchao._models._eval import TransformerEvalWrapper
 model = AutoModelForCausalLM.from_pretrained(
     model_to_quantize,
-    device_map="auto",
+    device_map="cuda:0",
     torch_dtype=torch.bfloat16,
 )
 tokenizer = AutoTokenizer.from_pretrained(model_id)
 
-base_config = Int4WeightOnlyConfig(group_size=128)
+base_config = Int4WeightOnlyConfig(group_size=128, int4_packing_format="tile_packed_to_4d", int4_choose_qparams_algorithm="hqq")
 quant_config = AWQConfig(base_config, step="prepare")
 quantize_(
     model,
@@ -333,7 +333,7 @@ def _untie_weights_and_save_locally(model_id):
 model = AutoModelForCausalLM.from_pretrained(
     model_name,
     torch_dtype="auto",
-    device_map="auto"
+    device_map="cuda:0"
 )
 
 # prepare the model input
@@ -394,7 +394,7 @@ def _untie_weights_and_save_locally(model_id):
 
 # use "{base_model}" or "{quantized_model}"
 model_id = "{quantized_model}"
-quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16)
+quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda:0", torch_dtype=torch.bfloat16)
 tokenizer = AutoTokenizer.from_pretrained(model_id)
 
 torch.cuda.reset_peak_memory_stats()
@@ -438,7 +438,8 @@ def _untie_weights_and_save_locally(model_id):
 | Benchmark (Latency)              |                |                          |
 |----------------------------------|----------------|--------------------------|
 |                                  | {base_model}   | {quantized_model}        |
-| latency (batch_size=1)           | ?s          | ?s (?x speedup)    |
+| latency (batch_size=1)           | ?s             | ?s (?x speedup)          |
+| latency (batch_size=256)         | ?s             | ?s (?x speedup)          |
 
 <details>
 <summary> Reproduce Model Performance Results </summary>
@@ -470,48 +471,6 @@ def _untie_weights_and_save_locally(model_id):
 export MODEL={quantized_model}
 VLLM_DISABLE_COMPILE_CACHE=1 python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model $MODEL --batch-size 1
 ```
-
-## benchmark_serving
-
-We benchmarked the throughput in a serving environment.
-
-Download sharegpt dataset:
-
-```Shell
-wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
-```
-
-
-
-Other datasets can be found in: https://github.com/vllm-project/vllm/tree/main/benchmarks
-
-Note: you can change the number of prompts to be benchmarked with `--num-prompts` argument for `benchmark_serving` script.
-
-### baseline
-Server:
-```Shell
-export MODEL={base_model}
-vllm serve $MODEL --tokenizer $MODEL -O3
-```
-
-Client:
-```Shell
-export MODEL={base_model}
-python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer $MODEL --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model $MODEL --num-prompts 1
-```
-
-### {quant}
-Server:
-```Shell
-export MODEL={quantized_model}
-VLLM_DISABLE_COMPILE_CACHE=1 vllm serve $MODEL --tokenizer $MODEL -O3 --pt-load-map-location cuda:0
-```
-
-Client:
-```Shell
-export MODEL={quantized_model}
-python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer $MODEL --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model $MODEL --num-prompts 1
-```
 </details>
 """
 
@@ -538,7 +497,7 @@ def _untie_weights_and_save_locally(model_id):
 import torch
 
 model_id = "{base_model}"
-untied_model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
+untied_model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="cuda:0")
 tokenizer = AutoTokenizer.from_pretrained(model_id)
 
 print(untied_model)
@@ -592,7 +551,7 @@ def _untie_weights_and_save_locally(model_id):
 python -m executorch.examples.models.qwen3.convert_weights $(hf download {quantized_model}) pytorch_model_converted.bin
 ```
 
-Once we have the checkpoint, we export it to ExecuTorch with a max_seq_length/max_context_length of 1024 to the XNNPACK backend as follows. 
+Once we have the checkpoint, we export it to ExecuTorch with a max_seq_length/max_context_length of 1024 to the XNNPACK backend as follows.
 
 [TODO: fix config path in note where necessary]
 (Note: ExecuTorch LLM export script requires config.json have certain key names. The correct config to use for the LLM export script is located at examples/models/qwen3/config/4b_config.json within the ExecuTorch repo.)
@@ -611,7 +570,7 @@ def _untie_weights_and_save_locally(model_id):
   --max_context_length 1024 \
   --max_seq_length 1024 \
   --dtype fp32 \
-  --metadata '{"get_bos_id":199999, "get_eos_ids":[200020,199999]}'
+  --metadata '{{"get_bos_id":199999, "get_eos_ids":[200020,199999]}}'
 ```
 
 After that you can run the model in a mobile app (see [Running in a mobile app](#running-in-a-mobile-app)).
@@ -672,12 +631,16 @@ def quantize_and_upload(
         assert quant == "AWQ-INT4", "Only support AWQ-INT4 for now"
         model = AutoModelForCausalLM.from_pretrained(
             model_to_quantize,
-            device_map="auto",
+            device_map="cuda:0",
             torch_dtype=torch.bfloat16,
         )
         tokenizer = AutoTokenizer.from_pretrained(model_id)
 
-        base_config = Int4WeightOnlyConfig(group_size=128)
+        base_config = Int4WeightOnlyConfig(
+            group_size=128,
+            int4_packing_format="tile_packed_to_4d",
+            int4_choose_qparams_algorithm="hqq",
+        )
         quant_config = AWQConfig(base_config, step="prepare")
         quantize_(
             model,
@@ -712,7 +675,7 @@ def quantize_and_upload(
         )
         quantized_model = AutoModelForCausalLM.from_pretrained(
             model_to_quantize,
-            device_map="auto",
+            device_map="cuda:0",
             torch_dtype=torch.bfloat16,
             quantization_config=quantization_config,
         )
diff --git a/.github/scripts/torchao_model_releases/summarize_results.sh b/.github/scripts/torchao_model_releases/summarize_results.sh

Original file line number	Diff line number	Diff line change
`@@ -12,13 +12,8 @@ check_torch() {`
`12`	`12`	`}`
`13`	`13`
`14`	`14`	`check_vllm() {`
`15`		`- # Check if VLLM_DIR is set`
`16`		`- if [ -z "$VLLM_DIR" ]; then`
`17`		`- echo "Error: VLLM_DIR environment variable is not set. Please set it before running this script."`
`18`		`- exit 1`
`19`		`- fi`
`20`	`15`	`if ! pip show vllm > /dev/null 2>&1; then`
`21`		`- echo "Error: vllm package is NOT installed. please install from source: https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html#set-up-using-python-only-build-without-compilation" >&2`
	`16`	+ echo "Error: vllm package is NOT installed. please install with `pip install vllm`" >&2
`22`	`17`	`exit 1`
`23`	`18`	`fi`
`24`	`19`	`}`
Original file line number	Diff line number	Diff line change
`@@ -12,7 +12,7 @@`
`12`	`12`
`13`	`13`	`def eval_peak_memory_usage(model_id: str):`
`14`	`14`	`model = AutoModelForCausalLM.from_pretrained(`
`15`		`- model_id, device_map="auto", torch_dtype=torch.bfloat16`
	`15`	`+ model_id, device_map="cuda:0", torch_dtype=torch.bfloat16`
`16`	`16`	`)`
`17`	`17`	`tokenizer = AutoTokenizer.from_pretrained(model_id)`
`18`	`18`