Skip to content

Commit 85614a4

Browse files
committed
Merge remote-tracking branch 'origin/main' into wengshiy/register
2 parents d6d63f0 + 18dbe87 commit 85614a4

File tree

152 files changed

+5848
-4213
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

152 files changed

+5848
-4213
lines changed

.github/scripts/torchao_model_releases/README.md

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,8 @@ Examples:
1818
./release.sh --model_id Qwen/Qwen3-8B --quants INT4 FP8
1919
```
2020

21+
Note: for initial release, please include `--populate_model_card_template` to populate model card template.
22+
2123
### AWQ-INT4
2224
[AWQ](https://arxiv.org/abs/2306.00978) is a technique to improve accuracy for weight only quantization. It improves accuracy by preserving "salient" weight channels that has high impact on the accuracy of output, through multiplying the weight channel by a scale, and do the reverse for the correspnoding activation, since activation is not quantized, there is no additional loss from activation, while the quantization loss from weight can be reduced.
2325

@@ -30,6 +32,15 @@ Examples:
3032
python quantize_and_upload.py --model_id Qwen/Qwen3-8B --quant AWQ-INT4 --push_to_hub --task bbh --calibration_limit 2
3133
```
3234

35+
### Update checkpoints for a different user_id (e.g. pytorch)
36+
Sometimes we may want to update the checkpoints for a different user id, without changing model card. For this we can use `--push_to_user_id`, e.g.
37+
38+
```
39+
sh release.sh --model_id microsoft/Phi-4-mini-instruct --quants FP8 --push_to_hub --push_to_user_id pytorch
40+
```
41+
42+
This will update `pytorch/Phi-4-mini-instruct-FP8` without changing the model card.
43+
3344
## Eval
3445
After we run the release script for a model, we can find new models in the huggingface hub page for the user, e.g. https://huggingface.co/torchao-testing, the models will have a model card that's filled in with template content, such as information about the model and eval instructions, there are a few things we need to fill in, including 1. peak memory usage, 2. latency when running model with vllm and 3. quality measurement using lm-eval.
3546

@@ -78,7 +89,7 @@ After environment is setup, we can run eval:
7889
sh eval.sh --eval_type quality --model_ids Qwen/Qwen3-8B --tasks hellaswag,mmlu
7990
```
8091

81-
# ### Summarize results
92+
#### Summarize results
8293
After we have finished all evals for each model, we can summarize the results with:
8394
```
8495
sh summarize_results.sh --model_ids Qwen/Qwen3-8B pytorch/Qwen3-8B-INT4

.github/scripts/torchao_model_releases/eval.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -110,5 +110,5 @@ done
110110

111111
# Run summarize_results.sh with MODEL_IDS if eval_type is "all"
112112
if [[ "$EVAL_TYPE" == "all" ]]; then
113-
sh summarize_results.sh --model_id "${MODEL_ID_ARRAY[@]}"
113+
sh summarize_results.sh --model_ids "${MODEL_ID_ARRAY[@]}"
114114
fi

.github/scripts/torchao_model_releases/eval_latency.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -75,7 +75,7 @@ for MODEL_ID in "${MODEL_ID_ARRAY[@]}"; do
7575
for BATCH_SIZE in "${BATCH_SIZE_ARRAY[@]}"; do
7676
OUTPUT_FILE="$ORIG_DIR/${SAFE_MODEL_ID}_latency_batch${BATCH_SIZE}_in${INPUT_LEN}_out${OUTPUT_LEN}.log"
7777
echo "Running latency eval for model $MODEL_ID with batch size $BATCH_SIZE with input length: $INPUT_LEN and output length: $OUTPUT_LEN"
78-
VLLM_DISABLE_COMPILE_CACHE=1 python benchmarks/benchmark_latency.py --input-len $INPUT_LEN --output-len $OUTPUT_LEN --model $MODEL_ID --batch-size $BATCH_SIZE > "$OUTPUT_FILE" 2>&1
78+
VLLM_DISABLE_COMPILE_CACHE=1 vllm bench latency --input-len $INPUT_LEN --output-len $OUTPUT_LEN --model $MODEL_ID --batch-size $BATCH_SIZE > "$OUTPUT_FILE" 2>&1
7979
echo "Latency eval result saved to $OUTPUT_FILE"
8080
done
8181
echo "======================== Eval Latency $MODEL_ID End ========================="

.github/scripts/torchao_model_releases/quantize_and_upload.py

Lines changed: 52 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
# LICENSE file in the root directory of this source tree.
66

77
import argparse
8+
from typing import List
89

910
import torch
1011
from huggingface_hub import ModelCard, get_token, whoami
@@ -206,7 +207,7 @@ def _untie_weights_and_save_locally(model_id):
206207

207208
_int4_quant_code = """
208209
from torchao.quantization import Int4WeightOnlyConfig
209-
quant_config = Int4WeightOnlyConfig(group_size=128, packing_format="tile_packed_to_4d", int4_choose_qparams_algorithm="hqq", version=2)
210+
quant_config = Int4WeightOnlyConfig(group_size=128, int4_packing_format="tile_packed_to_4d", int4_choose_qparams_algorithm="hqq")
210211
quantization_config = TorchAoConfig(quant_type=quant_config)
211212
quantized_model = AutoModelForCausalLM.from_pretrained(model_to_quantize, device_map="auto", torch_dtype=torch.bfloat16, quantization_config=quantization_config)
212213
tokenizer = AutoTokenizer.from_pretrained(model_id)
@@ -230,12 +231,10 @@ def _untie_weights_and_save_locally(model_id):
230231
embedding_config = IntxWeightOnlyConfig(
231232
weight_dtype=torch.int8,
232233
granularity=PerAxis(0),
233-
version=2,
234234
)
235235
linear_config = Int8DynamicActivationIntxWeightConfig(
236236
weight_dtype=torch.int4,
237237
weight_granularity=PerGroup(32),
238-
version=2,
239238
)
240239
quant_config = ModuleFqnToConfig({{"_default": linear_config, "model.embed_tokens": embedding_config}})
241240
quantization_config = TorchAoConfig(quant_type=quant_config, include_input_output_embeddings=True, modules_to_not_convert=[])
@@ -256,7 +255,7 @@ def _untie_weights_and_save_locally(model_id):
256255
)
257256
tokenizer = AutoTokenizer.from_pretrained(model_id)
258257
259-
base_config = Int4WeightOnlyConfig(group_size=128, version=2)
258+
base_config = Int4WeightOnlyConfig(group_size=128)
260259
quant_config = AWQConfig(base_config, step="prepare")
261260
quantize_(
262261
model,
@@ -585,57 +584,66 @@ def _untie_weights_and_save_locally(model_id):
585584
Once ExecuTorch is [set-up](https://pytorch.org/executorch/main/getting-started.html), exporting and running the model on device is a breeze.
586585
587586
ExecuTorch's LLM export scripts require the checkpoint keys and parameters have certain names, which differ from those used in Hugging Face.
588-
So we first use a conversion script that converts the Hugging Face checkpoint key names to ones that ExecuTorch expects:
587+
So we first use a script that converts the Hugging Face checkpoint key names to ones that ExecuTorch expects:
588+
The following script does this for you.
589589
590590
[TODO: fix command below where necessary]
591591
```Shell
592592
python -m executorch.examples.models.qwen3.convert_weights $(hf download {quantized_model}) pytorch_model_converted.bin
593593
```
594594
595-
Once we have the checkpoint, we export it to ExecuTorch with the XNNPACK backend as follows.
596-
(ExecuTorch LLM export script requires config.json have certain key names. The correct config to use for the LLM export script is located at [TODO: fill in, e.g., examples/models/qwen3/config/4b_config.json] within the ExecuTorch repo.)
595+
Once we have the checkpoint, we export it to ExecuTorch with a max_seq_length/max_context_length of 1024 to the XNNPACK backend as follows.
596+
597+
[TODO: fix config path in note where necessary]
598+
(Note: ExecuTorch LLM export script requires config.json have certain key names. The correct config to use for the LLM export script is located at examples/models/qwen3/config/4b_config.json within the ExecuTorch repo.)
597599
598600
[TODO: fix command below where necessary]
599601
```Shell
600602
python -m executorch.examples.models.llama.export_llama \
601-
--model "qwen3_4b" \
602-
--checkpoint pytorch_model_converted.bin \
603-
--params examples/models/qwen3/config/4b_config.json \
604-
--output_name="model.pte" \
605-
-kv \
606-
--use_sdpa_with_kv_cache \
607-
-X \
608-
--xnnpack-extended-ops \
609-
--max_context_length 1024 \
610-
--max_seq_length 1024 \
611-
--dtype fp32 \
612-
--metadata '{{"get_bos_id":199999, "get_eos_ids":[200020,199999]}}'
603+
--model "qwen3_4b" \
604+
--checkpoint pytorch_model_converted.bin \
605+
--params examples/models/qwen3/config/4b_config.json \
606+
--output_name model.pte \
607+
-kv \
608+
--use_sdpa_with_kv_cache \
609+
-X \
610+
--xnnpack-extended-ops \
611+
--max_context_length 1024 \
612+
--max_seq_length 1024 \
613+
--dtype fp32 \
614+
--metadata '{"get_bos_id":199999, "get_eos_ids":[200020,199999]}'
613615
```
614616
615617
After that you can run the model in a mobile app (see [Running in a mobile app](#running-in-a-mobile-app)).
618+
619+
(We try to keep these instructions up-to-date, but if you find they do not work, check out our [CI test in ExecuTorch](https://github.com/pytorch/executorch/blob/main/.ci/scripts/test_torchao_huggingface_checkpoints.sh) for the latest source of truth, and let us know we need to update our model card.)
616620
"""
617621

618622

619623
def quantize_and_upload(
620-
model_id, quant, tasks, calibration_limit, max_seq_length, push_to_hub
624+
model_id: str,
625+
quant: str,
626+
tasks: List[str],
627+
calibration_limit: int,
628+
max_seq_length: int,
629+
push_to_hub: bool,
630+
push_to_user_id: str,
631+
populate_model_card_template: bool,
621632
):
622633
_int8_int4_linear_config = Int8DynamicActivationIntxWeightConfig(
623634
weight_dtype=torch.int4,
624635
weight_granularity=PerGroup(32),
625-
version=2,
626636
)
627637
_int8_int4_embedding_config = IntxWeightOnlyConfig(
628638
weight_dtype=torch.int8,
629639
granularity=PerAxis(0),
630-
version=2,
631640
)
632641
quant_to_config = {
633642
"FP8": Float8DynamicActivationFloat8WeightConfig(granularity=PerRow()),
634643
"INT4": Int4WeightOnlyConfig(
635644
group_size=128,
636-
packing_format="tile_packed_to_4d",
645+
int4_packing_format="tile_packed_to_4d",
637646
int4_choose_qparams_algorithm="hqq",
638-
version=2,
639647
),
640648
"INT8-INT4": ModuleFqnToConfig(
641649
{
@@ -669,7 +677,7 @@ def quantize_and_upload(
669677
)
670678
tokenizer = AutoTokenizer.from_pretrained(model_id)
671679

672-
base_config = Int4WeightOnlyConfig(group_size=128, version=2)
680+
base_config = Int4WeightOnlyConfig(group_size=128)
673681
quant_config = AWQConfig(base_config, step="prepare")
674682
quantize_(
675683
model,
@@ -713,7 +721,9 @@ def quantize_and_upload(
713721
username = _get_username()
714722

715723
MODEL_NAME = model_id.split("/")[-1]
716-
save_to = f"{username}/{MODEL_NAME}-{quant}"
724+
725+
save_to_user_id = username if push_to_user_id is None else push_to_user_id
726+
save_to = f"{save_to_user_id}/{MODEL_NAME}-{quant}"
717727
untied_model_path = 'f"{{MODEL_NAME}}-untied-weights"'
718728
is_mobile = quant == "INT8-INT4"
719729
quantized_model_id = save_to
@@ -759,7 +769,8 @@ def quantize_and_upload(
759769
if push_to_hub:
760770
quantized_model.push_to_hub(quantized_model_id, safe_serialization=False)
761771
tokenizer.push_to_hub(quantized_model_id)
762-
card.push_to_hub(quantized_model_id)
772+
if populate_model_card_template:
773+
card.push_to_hub(quantized_model_id)
763774
else:
764775
quantized_model.save_pretrained(quantized_model_id, safe_serialization=False)
765776
tokenizer.save_pretrained(quantized_model_id)
@@ -828,6 +839,18 @@ def quantize_and_upload(
828839
default=False,
829840
help="Flag to indicate whether push to huggingface hub or not",
830841
)
842+
parser.add_argument(
843+
"--push_to_user_id",
844+
type=str,
845+
default=None,
846+
help="The user_id to use for pushing the quantized model, only used when --push_to_hub is set",
847+
)
848+
parser.add_argument(
849+
"--populate_model_card_template",
850+
action="store_true",
851+
default=False,
852+
help="Flag to indicate whether push model card to huggingface hub or not",
853+
)
831854
args = parser.parse_args()
832855
quantize_and_upload(
833856
args.model_id,
@@ -836,4 +859,6 @@ def quantize_and_upload(
836859
args.calibration_limit,
837860
args.max_seq_length,
838861
args.push_to_hub,
862+
args.push_to_user_id,
863+
args.populate_model_card_template,
839864
)

.github/scripts/torchao_model_releases/release.sh

Lines changed: 14 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -6,15 +6,13 @@
66

77
#!/bin/bash
88

9-
# Example uses
10-
# release with default quant options (FP8, INT4, INT8-INT4)
11-
# ./release.sh --model_id Qwen/Qwen3-8B
12-
# release a custom set of quant options
13-
# ./release.sh --model_id Qwen/Qwen3-8B --quants INT4 FP8
9+
# see README.md for instructions
1410

1511
# Default quantization options
1612
default_quants=("FP8" "INT4" "INT8-INT4")
1713
push_to_hub=""
14+
push_to_user_id=""
15+
populate_model_card_template=""
1816
# Parse arguments
1917
while [[ $# -gt 0 ]]; do
2018
case "$1" in
@@ -34,6 +32,14 @@ while [[ $# -gt 0 ]]; do
3432
push_to_hub="--push_to_hub"
3533
shift
3634
;;
35+
--push_to_user_id)
36+
push_to_user_id=("--push_to_user_id $2")
37+
shift 2
38+
;;
39+
--populate_model_card_template)
40+
populate_model_card_template="--populate_model_card_template"
41+
shift
42+
;;
3743
*)
3844
echo "Unknown option: $1"
3945
exit 1
@@ -43,14 +49,14 @@ done
4349
# Use default quants if none specified
4450
if [[ -z "$model_id" ]]; then
4551
echo "Error: --model_id is required"
46-
echo "Usage: $0 --model_id <model_id> [--quants <quant1> [quant2 ...]] [--push_to_hub]"
52+
echo "Usage: $0 --model_id <model_id> [--quants <quant1> [quant2 ...]] [--push_to_hub] [--push_to_user_id <push_to_user_id>] [--populate_model_card_template]"
4753
exit 1
4854
fi
4955
if [[ ${#quants[@]} -eq 0 ]]; then
5056
quants=("${default_quants[@]}")
5157
fi
5258
# Run the python command for each quantization option
5359
for quant in "${quants[@]}"; do
54-
echo "Running: python quantize_and_upload.py --model_id $model_id --quant $quant $push_to_hub"
55-
python quantize_and_upload.py --model_id "$model_id" --quant "$quant" $push_to_hub
60+
echo "Running: python quantize_and_upload.py --model_id $model_id --quant $quant $push_to_hub $push_to_user_id $populate_model_card_template"
61+
python quantize_and_upload.py --model_id "$model_id" --quant "$quant" $push_to_hub $push_to_user_id $populate_model_card_template
5662
done

.github/workflows/torchao_experimental_test.yml renamed to .github/workflows/regression_test_aarch64.yml

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
name: Run TorchAO Experimental Tests
1+
name: Run Regression Tests (aarch64)
22

33
on:
44
push:
@@ -44,17 +44,20 @@ jobs:
4444
if: runner.os == 'Linux'
4545
run: |
4646
conda activate venv
47+
pip install coremltools
4748
pip install torch==2.7.0 --index-url https://download.pytorch.org/whl/cpu --force-reinstall
4849
pip install -r dev-requirements.txt
4950
BUILD_TORCHAO_EXPERIMENTAL=1 TORCHAO_BUILD_CPU_AARCH64=1 TORCHAO_BUILD_KLEIDIAI=1 TORCHAO_ENABLE_ARM_NEON_DOT=1 TORCHAO_PARALLEL_BACKEND=OPENMP pip install .
5051
- name: Run python tests
5152
run: |
5253
conda activate venv
5354
pytest -s test/quantization/test_int8_dynamic_activation_intx_weight_config_v1.py
54-
pytest -s torchao/experimental/tests/test_embedding_xbit_quantizer.py
55-
pytest -s torchao/experimental/tests/test_quant_passes.py
56-
pytest -s test/prototype/test_dynamic_activation_lut.py
5755
pytest -s test/quantization/quantize_/workflows/intx/test_intx_opaque_tensor.py
56+
pytest -s test/prototype/test_embedding.py
57+
pytest -s test/prototype/test_int8_lut_tensor.py
58+
pytest -s test/prototype/test_tensor_conversion.py
59+
pytest -s test/prototype/test_groupwise_lowbit_weight_lut_quantizer.py
60+
pytest -s test/prototype/test_parq.py
5861
- name: torchao/csrc/cpu - build and run C++ tests
5962
if: runner.os == 'macOS'
6063
run: |

benchmarks/benchmark_aq.py

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -10,10 +10,10 @@
1010
import torch
1111

1212
from torchao.quantization.quant_api import (
13+
Int4WeightOnlyConfig,
14+
Int8DynamicActivationInt8WeightConfig,
15+
Int8WeightOnlyConfig,
1316
_replace_with_custom_fn_if_matches_filter,
14-
int4_weight_only,
15-
int8_dynamic_activation_int8_weight,
16-
int8_weight_only,
1717
quantize_,
1818
)
1919
from torchao.quantization.subclass import (
@@ -23,13 +23,13 @@
2323

2424

2525
def _int8wo_api(mod, **kwargs):
26-
quantize_(mod, int8_weight_only(**kwargs), set_inductor_config=False)
26+
quantize_(mod, Int8WeightOnlyConfig(**kwargs), set_inductor_config=False)
2727

2828

2929
def _int8da_int8w_api(mod, **kwargs):
3030
quantize_(
3131
mod,
32-
int8_dynamic_activation_int8_weight(**kwargs),
32+
Int8DynamicActivationInt8WeightConfig(**kwargs),
3333
set_inductor_config=False,
3434
)
3535

@@ -39,7 +39,7 @@ def _int4wo_api(mod, **kwargs):
3939
if "groupsize" in kwargs_copy:
4040
kwargs_copy["group_size"] = kwargs_copy["groupsize"]
4141
del kwargs_copy["groupsize"]
42-
quantize_(mod, int4_weight_only(**kwargs_copy), set_inductor_config=False)
42+
quantize_(mod, Int4WeightOnlyConfig(**kwargs_copy), set_inductor_config=False)
4343

4444

4545
class ToyLinearModel(torch.nn.Module):

benchmarks/float8/training/llama3.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,7 @@ cd ${TORCHTITAN_ROOT}
5353
echo "float8 args: ${FLOAT8_ARGS}"
5454

5555
# run the command with the specified arguments
56-
CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ${TORCHTITAN_ROOT}/run_train.sh --training.steps=${STEPS} --training.local-batch-size=${LOCAL_BATCH_SIZE} --training.compile ${FLOAT8_ARGS} ${EXTRA_ARGS} 2>&1 | tee ${LOG_FILE}
56+
CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ${TORCHTITAN_ROOT}/run_train.sh --training.steps=${STEPS} --training.local-batch-size=${LOCAL_BATCH_SIZE} --compile.enable ${FLOAT8_ARGS} ${EXTRA_ARGS} 2>&1 | tee ${LOG_FILE}
5757

5858
# return to original working directory
5959
cd $original_dir

benchmarks/microbenchmarks/utils.py

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -260,7 +260,6 @@ def string_to_config(
260260
"int8_dynamic_activation_intx_weight requires using high_precision_dtype=torch.float32"
261261
)
262262

263-
from torchao.dtypes import PackedLinearInt8DynamicActivationIntxWeightLayout
264263
from torchao.quantization.granularity import PerAxis, PerGroup
265264
from torchao.quantization.quant_api import (
266265
Int8DynamicActivationIntxWeightConfig,
@@ -278,8 +277,7 @@ def string_to_config(
278277
weight_mapping_type=MappingType.ASYMMETRIC
279278
if is_asymmetric
280279
else MappingType.SYMMETRIC,
281-
weight_scale_dtype=torch.bfloat16,
282-
layout=PackedLinearInt8DynamicActivationIntxWeightLayout(),
280+
intx_packing_format="opaque_torchao_auto",
283281
)
284282
elif "float8wo" in quantization:
285283
return Float8WeightOnlyConfig()

benchmarks/prototype/moe_training/benchmark_2d_3d_grouped_gemms.py renamed to benchmarks/prototype/moe_training/bench_2d_3d_grouped_gemm.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818
from torchao.float8.config import ScalingGranularity
1919
from torchao.float8.float8_utils import tensor_to_scale, to_fp8_saturated
2020
from torchao.prototype.moe_training.kernels.mxfp8_blocked_scales import (
21-
torch_to_blocked_per_group_2d,
21+
torch_to_blocked_2d_M_groups,
2222
torch_to_blocked_per_group_3d,
2323
)
2424
from torchao.prototype.moe_training.utils import generate_jagged_offs
@@ -230,8 +230,8 @@ def bench_mxfp8_grouped_mm(A, B_t, offs, block_size=32) -> float:
230230

231231
# Convert scales for each group to blocked format.
232232
Mg, K = A_fp8.shape
233-
A_scales_blocked, starting_row_after_padding = torch_to_blocked_per_group_2d(
234-
A_scales, offs, Mg, K
233+
A_scales_blocked, starting_row_after_padding = torch_to_blocked_2d_M_groups(
234+
A_scales, offs, K
235235
)
236236
B_scales_blocked = torch_to_blocked_per_group_3d(B_scales)
237237

0 commit comments

Comments
 (0)