Skip to content

Commit 38b5e8f

Browse files
committed
Update base for Update on "Remove internal usage of all config functions like int4_weight_only"
**Summary:** These are now deprecated as of #2994. We should stop using them internally as well. **Test Plan:** CI [ghstack-poisoned]
2 parents 0c85173 + ea8c00f commit 38b5e8f

File tree

94 files changed

+4136
-3497
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

94 files changed

+4136
-3497
lines changed

.github/scripts/torchao_model_releases/README.md

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,8 @@ Examples:
1818
./release.sh --model_id Qwen/Qwen3-8B --quants INT4 FP8
1919
```
2020

21+
Note: for initial release, please include `--populate_model_card_template` to populate model card template.
22+
2123
### AWQ-INT4
2224
[AWQ](https://arxiv.org/abs/2306.00978) is a technique to improve accuracy for weight only quantization. It improves accuracy by preserving "salient" weight channels that has high impact on the accuracy of output, through multiplying the weight channel by a scale, and do the reverse for the correspnoding activation, since activation is not quantized, there is no additional loss from activation, while the quantization loss from weight can be reduced.
2325

@@ -30,6 +32,15 @@ Examples:
3032
python quantize_and_upload.py --model_id Qwen/Qwen3-8B --quant AWQ-INT4 --push_to_hub --task bbh --calibration_limit 2
3133
```
3234

35+
### Update checkpoints for a different user_id (e.g. pytorch)
36+
Sometimes we may want to update the checkpoints for a different user id, without changing model card. For this we can use `--push_to_user_id`, e.g.
37+
38+
```
39+
sh release.sh --model_id microsoft/Phi-4-mini-instruct --quants FP8 --push_to_hub --push_to_user_id pytorch
40+
```
41+
42+
This will update `pytorch/Phi-4-mini-instruct-FP8` without changing the model card.
43+
3344
## Eval
3445
After we run the release script for a model, we can find new models in the huggingface hub page for the user, e.g. https://huggingface.co/torchao-testing, the models will have a model card that's filled in with template content, such as information about the model and eval instructions, there are a few things we need to fill in, including 1. peak memory usage, 2. latency when running model with vllm and 3. quality measurement using lm-eval.
3546

@@ -78,7 +89,7 @@ After environment is setup, we can run eval:
7889
sh eval.sh --eval_type quality --model_ids Qwen/Qwen3-8B --tasks hellaswag,mmlu
7990
```
8091

81-
# ### Summarize results
92+
#### Summarize results
8293
After we have finished all evals for each model, we can summarize the results with:
8394
```
8495
sh summarize_results.sh --model_ids Qwen/Qwen3-8B pytorch/Qwen3-8B-INT4

.github/scripts/torchao_model_releases/eval.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -110,5 +110,5 @@ done
110110

111111
# Run summarize_results.sh with MODEL_IDS if eval_type is "all"
112112
if [[ "$EVAL_TYPE" == "all" ]]; then
113-
sh summarize_results.sh --model_id "${MODEL_ID_ARRAY[@]}"
113+
sh summarize_results.sh --model_ids "${MODEL_ID_ARRAY[@]}"
114114
fi

.github/scripts/torchao_model_releases/eval_latency.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -75,7 +75,7 @@ for MODEL_ID in "${MODEL_ID_ARRAY[@]}"; do
7575
for BATCH_SIZE in "${BATCH_SIZE_ARRAY[@]}"; do
7676
OUTPUT_FILE="$ORIG_DIR/${SAFE_MODEL_ID}_latency_batch${BATCH_SIZE}_in${INPUT_LEN}_out${OUTPUT_LEN}.log"
7777
echo "Running latency eval for model $MODEL_ID with batch size $BATCH_SIZE with input length: $INPUT_LEN and output length: $OUTPUT_LEN"
78-
VLLM_DISABLE_COMPILE_CACHE=1 python benchmarks/benchmark_latency.py --input-len $INPUT_LEN --output-len $OUTPUT_LEN --model $MODEL_ID --batch-size $BATCH_SIZE > "$OUTPUT_FILE" 2>&1
78+
VLLM_DISABLE_COMPILE_CACHE=1 vllm bench latency --input-len $INPUT_LEN --output-len $OUTPUT_LEN --model $MODEL_ID --batch-size $BATCH_SIZE > "$OUTPUT_FILE" 2>&1
7979
echo "Latency eval result saved to $OUTPUT_FILE"
8080
done
8181
echo "======================== Eval Latency $MODEL_ID End ========================="

.github/scripts/torchao_model_releases/quantize_and_upload.py

Lines changed: 48 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
# LICENSE file in the root directory of this source tree.
66

77
import argparse
8+
from typing import List
89

910
import torch
1011
from huggingface_hub import ModelCard, get_token, whoami
@@ -230,12 +231,10 @@ def _untie_weights_and_save_locally(model_id):
230231
embedding_config = IntxWeightOnlyConfig(
231232
weight_dtype=torch.int8,
232233
granularity=PerAxis(0),
233-
version=2,
234234
)
235235
linear_config = Int8DynamicActivationIntxWeightConfig(
236236
weight_dtype=torch.int4,
237237
weight_granularity=PerGroup(32),
238-
version=2,
239238
)
240239
quant_config = ModuleFqnToConfig({{"_default": linear_config, "model.embed_tokens": embedding_config}})
241240
quantization_config = TorchAoConfig(quant_type=quant_config, include_input_output_embeddings=True, modules_to_not_convert=[])
@@ -585,49 +584,59 @@ def _untie_weights_and_save_locally(model_id):
585584
Once ExecuTorch is [set-up](https://pytorch.org/executorch/main/getting-started.html), exporting and running the model on device is a breeze.
586585
587586
ExecuTorch's LLM export scripts require the checkpoint keys and parameters have certain names, which differ from those used in Hugging Face.
588-
So we first use a conversion script that converts the Hugging Face checkpoint key names to ones that ExecuTorch expects:
587+
So we first use a script that converts the Hugging Face checkpoint key names to ones that ExecuTorch expects:
588+
The following script does this for you.
589589
590590
[TODO: fix command below where necessary]
591591
```Shell
592592
python -m executorch.examples.models.qwen3.convert_weights $(hf download {quantized_model}) pytorch_model_converted.bin
593593
```
594594
595-
Once we have the checkpoint, we export it to ExecuTorch with the XNNPACK backend as follows.
596-
(ExecuTorch LLM export script requires config.json have certain key names. The correct config to use for the LLM export script is located at [TODO: fill in, e.g., examples/models/qwen3/config/4b_config.json] within the ExecuTorch repo.)
595+
Once we have the checkpoint, we export it to ExecuTorch with a max_seq_length/max_context_length of 1024 to the XNNPACK backend as follows.
596+
597+
[TODO: fix config path in note where necessary]
598+
(Note: ExecuTorch LLM export script requires config.json have certain key names. The correct config to use for the LLM export script is located at examples/models/qwen3/config/4b_config.json within the ExecuTorch repo.)
597599
598600
[TODO: fix command below where necessary]
599601
```Shell
600602
python -m executorch.examples.models.llama.export_llama \
601-
--model "qwen3_4b" \
602-
--checkpoint pytorch_model_converted.bin \
603-
--params examples/models/qwen3/config/4b_config.json \
604-
--output_name="model.pte" \
605-
-kv \
606-
--use_sdpa_with_kv_cache \
607-
-X \
608-
--xnnpack-extended-ops \
609-
--max_context_length 1024 \
610-
--max_seq_length 1024 \
611-
--dtype fp32 \
612-
--metadata '{{"get_bos_id":199999, "get_eos_ids":[200020,199999]}}'
603+
--model "qwen3_4b" \
604+
--checkpoint pytorch_model_converted.bin \
605+
--params examples/models/qwen3/config/4b_config.json \
606+
--output_name model.pte \
607+
-kv \
608+
--use_sdpa_with_kv_cache \
609+
-X \
610+
--xnnpack-extended-ops \
611+
--max_context_length 1024 \
612+
--max_seq_length 1024 \
613+
--dtype fp32 \
614+
--metadata '{"get_bos_id":199999, "get_eos_ids":[200020,199999]}'
613615
```
614616
615617
After that you can run the model in a mobile app (see [Running in a mobile app](#running-in-a-mobile-app)).
618+
619+
(We try to keep these instructions up-to-date, but if you find they do not work, check out our [CI test in ExecuTorch](https://github.com/pytorch/executorch/blob/main/.ci/scripts/test_torchao_huggingface_checkpoints.sh) for the latest source of truth, and let us know we need to update our model card.)
616620
"""
617621

618622

619623
def quantize_and_upload(
620-
model_id, quant, tasks, calibration_limit, max_seq_length, push_to_hub
624+
model_id: str,
625+
quant: str,
626+
tasks: List[str],
627+
calibration_limit: int,
628+
max_seq_length: int,
629+
push_to_hub: bool,
630+
push_to_user_id: str,
631+
populate_model_card_template: bool,
621632
):
622633
_int8_int4_linear_config = Int8DynamicActivationIntxWeightConfig(
623634
weight_dtype=torch.int4,
624635
weight_granularity=PerGroup(32),
625-
version=2,
626636
)
627637
_int8_int4_embedding_config = IntxWeightOnlyConfig(
628638
weight_dtype=torch.int8,
629639
granularity=PerAxis(0),
630-
version=2,
631640
)
632641
quant_to_config = {
633642
"FP8": Float8DynamicActivationFloat8WeightConfig(granularity=PerRow()),
@@ -712,7 +721,9 @@ def quantize_and_upload(
712721
username = _get_username()
713722

714723
MODEL_NAME = model_id.split("/")[-1]
715-
save_to = f"{username}/{MODEL_NAME}-{quant}"
724+
725+
save_to_user_id = username if push_to_user_id is None else push_to_user_id
726+
save_to = f"{save_to_user_id}/{MODEL_NAME}-{quant}"
716727
untied_model_path = 'f"{{MODEL_NAME}}-untied-weights"'
717728
is_mobile = quant == "INT8-INT4"
718729
quantized_model_id = save_to
@@ -758,7 +769,8 @@ def quantize_and_upload(
758769
if push_to_hub:
759770
quantized_model.push_to_hub(quantized_model_id, safe_serialization=False)
760771
tokenizer.push_to_hub(quantized_model_id)
761-
card.push_to_hub(quantized_model_id)
772+
if populate_model_card_template:
773+
card.push_to_hub(quantized_model_id)
762774
else:
763775
quantized_model.save_pretrained(quantized_model_id, safe_serialization=False)
764776
tokenizer.save_pretrained(quantized_model_id)
@@ -827,6 +839,18 @@ def quantize_and_upload(
827839
default=False,
828840
help="Flag to indicate whether push to huggingface hub or not",
829841
)
842+
parser.add_argument(
843+
"--push_to_user_id",
844+
type=str,
845+
default=None,
846+
help="The user_id to use for pushing the quantized model, only used when --push_to_hub is set",
847+
)
848+
parser.add_argument(
849+
"--populate_model_card_template",
850+
action="store_true",
851+
default=False,
852+
help="Flag to indicate whether push model card to huggingface hub or not",
853+
)
830854
args = parser.parse_args()
831855
quantize_and_upload(
832856
args.model_id,
@@ -835,4 +859,6 @@ def quantize_and_upload(
835859
args.calibration_limit,
836860
args.max_seq_length,
837861
args.push_to_hub,
862+
args.push_to_user_id,
863+
args.populate_model_card_template,
838864
)

.github/scripts/torchao_model_releases/release.sh

Lines changed: 14 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -6,15 +6,13 @@
66

77
#!/bin/bash
88

9-
# Example uses
10-
# release with default quant options (FP8, INT4, INT8-INT4)
11-
# ./release.sh --model_id Qwen/Qwen3-8B
12-
# release a custom set of quant options
13-
# ./release.sh --model_id Qwen/Qwen3-8B --quants INT4 FP8
9+
# see README.md for instructions
1410

1511
# Default quantization options
1612
default_quants=("FP8" "INT4" "INT8-INT4")
1713
push_to_hub=""
14+
push_to_user_id=""
15+
populate_model_card_template=""
1816
# Parse arguments
1917
while [[ $# -gt 0 ]]; do
2018
case "$1" in
@@ -34,6 +32,14 @@ while [[ $# -gt 0 ]]; do
3432
push_to_hub="--push_to_hub"
3533
shift
3634
;;
35+
--push_to_user_id)
36+
push_to_user_id=("--push_to_user_id $2")
37+
shift 2
38+
;;
39+
--populate_model_card_template)
40+
populate_model_card_template="--populate_model_card_template"
41+
shift
42+
;;
3743
*)
3844
echo "Unknown option: $1"
3945
exit 1
@@ -43,14 +49,14 @@ done
4349
# Use default quants if none specified
4450
if [[ -z "$model_id" ]]; then
4551
echo "Error: --model_id is required"
46-
echo "Usage: $0 --model_id <model_id> [--quants <quant1> [quant2 ...]] [--push_to_hub]"
52+
echo "Usage: $0 --model_id <model_id> [--quants <quant1> [quant2 ...]] [--push_to_hub] [--push_to_user_id <push_to_user_id>] [--populate_model_card_template]"
4753
exit 1
4854
fi
4955
if [[ ${#quants[@]} -eq 0 ]]; then
5056
quants=("${default_quants[@]}")
5157
fi
5258
# Run the python command for each quantization option
5359
for quant in "${quants[@]}"; do
54-
echo "Running: python quantize_and_upload.py --model_id $model_id --quant $quant $push_to_hub"
55-
python quantize_and_upload.py --model_id "$model_id" --quant "$quant" $push_to_hub
60+
echo "Running: python quantize_and_upload.py --model_id $model_id --quant $quant $push_to_hub $push_to_user_id $populate_model_card_template"
61+
python quantize_and_upload.py --model_id "$model_id" --quant "$quant" $push_to_hub $push_to_user_id $populate_model_card_template
5662
done

.github/workflows/torchao_experimental_test.yml renamed to .github/workflows/regression_test_aarch64.yml

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
name: Run TorchAO Experimental Tests
1+
name: Run Regression Tests (aarch64)
22

33
on:
44
push:
@@ -44,17 +44,19 @@ jobs:
4444
if: runner.os == 'Linux'
4545
run: |
4646
conda activate venv
47+
pip install coremltools
4748
pip install torch==2.7.0 --index-url https://download.pytorch.org/whl/cpu --force-reinstall
4849
pip install -r dev-requirements.txt
4950
BUILD_TORCHAO_EXPERIMENTAL=1 TORCHAO_BUILD_CPU_AARCH64=1 TORCHAO_BUILD_KLEIDIAI=1 TORCHAO_ENABLE_ARM_NEON_DOT=1 TORCHAO_PARALLEL_BACKEND=OPENMP pip install .
5051
- name: Run python tests
5152
run: |
5253
conda activate venv
5354
pytest -s test/quantization/test_int8_dynamic_activation_intx_weight_config_v1.py
54-
pytest -s torchao/experimental/tests/test_embedding_xbit_quantizer.py
55-
pytest -s torchao/experimental/tests/test_quant_passes.py
56-
pytest -s test/prototype/test_dynamic_activation_lut.py
5755
pytest -s test/quantization/quantize_/workflows/intx/test_intx_opaque_tensor.py
56+
pytest -s test/prototype/test_embedding.py
57+
pytest -s test/prototype/test_int8_lut_tensor.py
58+
pytest -s test/prototype/test_groupwise_lowbit_weight_lut_quantizer.py
59+
pytest -s test/prototype/test_parq.py
5860
- name: torchao/csrc/cpu - build and run C++ tests
5961
if: runner.os == 'macOS'
6062
run: |

benchmarks/microbenchmarks/utils.py

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -260,7 +260,6 @@ def string_to_config(
260260
"int8_dynamic_activation_intx_weight requires using high_precision_dtype=torch.float32"
261261
)
262262

263-
from torchao.dtypes import PackedLinearInt8DynamicActivationIntxWeightLayout
264263
from torchao.quantization.granularity import PerAxis, PerGroup
265264
from torchao.quantization.quant_api import (
266265
Int8DynamicActivationIntxWeightConfig,
@@ -278,8 +277,7 @@ def string_to_config(
278277
weight_mapping_type=MappingType.ASYMMETRIC
279278
if is_asymmetric
280279
else MappingType.SYMMETRIC,
281-
weight_scale_dtype=torch.bfloat16,
282-
layout=PackedLinearInt8DynamicActivationIntxWeightLayout(),
280+
intx_packing_format="opaque_torchao_auto",
283281
)
284282
elif "float8wo" in quantization:
285283
return Float8WeightOnlyConfig()

benchmarks/prototype/moe_training/benchmark_2d_3d_grouped_gemms.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818
from torchao.float8.config import ScalingGranularity
1919
from torchao.float8.float8_utils import tensor_to_scale, to_fp8_saturated
2020
from torchao.prototype.moe_training.kernels.mxfp8_blocked_scales import (
21-
torch_to_blocked_per_group_2d,
21+
torch_to_blocked_2d_M_groups,
2222
torch_to_blocked_per_group_3d,
2323
)
2424
from torchao.prototype.moe_training.utils import generate_jagged_offs
@@ -230,8 +230,8 @@ def bench_mxfp8_grouped_mm(A, B_t, offs, block_size=32) -> float:
230230

231231
# Convert scales for each group to blocked format.
232232
Mg, K = A_fp8.shape
233-
A_scales_blocked, starting_row_after_padding = torch_to_blocked_per_group_2d(
234-
A_scales, offs, Mg, K
233+
A_scales_blocked, starting_row_after_padding = torch_to_blocked_2d_M_groups(
234+
A_scales, offs, K
235235
)
236236
B_scales_blocked = torch_to_blocked_per_group_3d(B_scales)
237237

benchmarks/prototype/moe_training/benchmark_2d_blocked_swizzle_scale_kernels.py

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -15,9 +15,9 @@
1515

1616
from benchmarks.utils import benchmark_cuda_function_in_microseconds
1717
from torchao.prototype.moe_training.kernels.mxfp8_blocked_scales import (
18-
compute_per_group_blocked_scale_offsets,
19-
torch_to_blocked_per_group_2d,
20-
triton_mx_block_rearrange_per_group_2d,
18+
compute_blocked_scale_offsets_for_M_groups,
19+
torch_to_blocked_2d_M_groups,
20+
triton_mx_block_rearrange_2d_M_groups,
2121
)
2222
from torchao.prototype.moe_training.utils import generate_jagged_offs
2323

@@ -82,9 +82,9 @@ def run_experiment(config: ExperimentConfig) -> ExperimentResult:
8282
input_group_offsets = generate_jagged_offs(num_groups, Mg, multiple_of=32)
8383

8484
# bench torch
85-
compiled_run_torch = torch.compile(torch_to_blocked_per_group_2d)
85+
compiled_run_torch = torch.compile(torch_to_blocked_2d_M_groups)
8686
torch_out_scales, torch_group_offs = compiled_run_torch(
87-
input_tensor, input_group_offsets, Mg, K
87+
input_tensor, input_group_offsets, K
8888
)
8989
torch_time_us = benchmark_cuda_function_in_microseconds(
9090
compiled_run_torch,
@@ -95,16 +95,16 @@ def run_experiment(config: ExperimentConfig) -> ExperimentResult:
9595
)
9696

9797
# bench triton
98-
_, output_group_offsets = compute_per_group_blocked_scale_offsets(
98+
_, output_group_offsets = compute_blocked_scale_offsets_for_M_groups(
9999
input_group_offsets
100100
)
101-
triton_out_scales = triton_mx_block_rearrange_per_group_2d(
101+
triton_out_scales = triton_mx_block_rearrange_2d_M_groups(
102102
input_tensor,
103103
input_group_offsets,
104104
output_group_offsets,
105105
)
106106
triton_time_us = benchmark_cuda_function_in_microseconds(
107-
triton_mx_block_rearrange_per_group_2d,
107+
triton_mx_block_rearrange_2d_M_groups,
108108
input_tensor,
109109
input_group_offsets,
110110
output_group_offsets,

docs/source/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,7 @@ for an overall introduction to the library and recent highlight and updates.
4545
finetuning
4646
serving
4747
torchao_vllm_integration
48+
torchao_hf_integration
4849
serialization
4950
static_quantization
5051
subclass_basic

0 commit comments

Comments
 (0)