support eval of float8_a1x128_w128x128 #3269

vkuzo · 2025-10-31T18:35:24Z

Summary:

Adds support for the new float8 scaling recipe in the official eval
scripts used to generate accuracy numbers in the README.

For now, I am using this as a smoke test that the scaling is working on
a real model - it is. We can add official benchmark results after we
hook up the cuBLAS binding on H100, which should make the UEX of
running evals a lot better.

Test Plan:

Smoke test on LLama-3.1-8B, accuracy looks good

// download checkpoint
with-proxy python scripts/download.py --hf_token {token} --repo_id meta-llama/Meta-Llama-3.1-8B

// prepare checkpoint
python scripts/convert_hf_checkpoint.py --checkpoint_dir checkpoints/meta-llama/Meta-Llama-3.1-8B

// run bf16 eval on a single task
with-proxy time python torchao/_models/llama/eval.py --checkpoint_path checkpoints/meta-llama/Meta-Llama-3.1-8B/model.pth --tasks 'winogrande'
...
winogrande: {'alias': 'winogrande', 'acc,none': 0.7426992896606156, 'acc_stderr,none': 0.012285989618865697}

// run float8 eval on the same task
with-proxy time python torchao/_models/llama/eval.py --checkpoint_path checkpoints/meta-llama/Meta-Llama-3.1-8B/model.pth --tasks 'winogrande' --quantization float8_a1x128_w128x128 --compile
...
winogrande: {'alias': 'winogrande', 'acc,none': 0.7419100236779794, 'acc_stderr,none': 0.012298278833972477}

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]

vkuzo · 2025-10-31T18:35:25Z

Stack from ghstack (oldest at bottom):

pytorch-bot · 2025-10-31T18:35:28Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3269

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

ROCm failures during provisioning step due to network issues

✅ No Failures

As of commit cafe668 with merge base f856d36 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Summary: Adds support for the new float8 scaling recipe in the official eval scripts used to generate accuracy numbers in the README. For now, I am using this as a smoke test that the scaling is working on a real model - it is. We can add official benchmark results after we hook up slayton's cuBLAS binding on H100, which should make the UEX of running evals a lot better. Test Plan: Smoke test on LLama-3.1-8B, accuracy looks good ``` // download checkpoint with-proxy python scripts/download.py --hf_token {token} --repo_id meta-llama/Meta-Llama-3.1-8B // prepare checkpoint python scripts/convert_hf_checkpoint.py --checkpoint_dir checkpoints/meta-llama/Meta-Llama-3.1-8B // run bf16 eval on a single task with-proxy time python torchao/_models/llama/eval.py --checkpoint_path checkpoints/meta-llama/Meta-Llama-3.1-8B/model.pth --tasks 'winogrande' ... winogrande: {'alias': 'winogrande', 'acc,none': 0.7426992896606156, 'acc_stderr,none': 0.012285989618865697} // run float8 eval on the same task with-proxy time python torchao/_models/llama/eval.py --checkpoint_path checkpoints/meta-llama/Meta-Llama-3.1-8B/model.pth --tasks 'winogrande' --quantization float8_a1x128_w128x128 --compile ... winogrande: {'alias': 'winogrande', 'acc,none': 0.7419100236779794, 'acc_stderr,none': 0.012298278833972477} ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 01b8d77 ghstack-comment-id: 3474380821 Pull-Request: #3269

[ghstack-poisoned]

Summary: Adds support for the new float8 scaling recipe in the official eval scripts used to generate accuracy numbers in the README. For now, I am using this as a smoke test that the scaling is working on a real model - it is. We can add official benchmark results after we hook up slayton's cuBLAS binding on H100, which should make the UEX of running evals a lot better. Test Plan: Smoke test on LLama-3.1-8B, accuracy looks good ``` // download checkpoint with-proxy python scripts/download.py --hf_token {token} --repo_id meta-llama/Meta-Llama-3.1-8B // prepare checkpoint python scripts/convert_hf_checkpoint.py --checkpoint_dir checkpoints/meta-llama/Meta-Llama-3.1-8B // run bf16 eval on a single task with-proxy time python torchao/_models/llama/eval.py --checkpoint_path checkpoints/meta-llama/Meta-Llama-3.1-8B/model.pth --tasks 'winogrande' ... winogrande: {'alias': 'winogrande', 'acc,none': 0.7426992896606156, 'acc_stderr,none': 0.012285989618865697} // run float8 eval on the same task with-proxy time python torchao/_models/llama/eval.py --checkpoint_path checkpoints/meta-llama/Meta-Llama-3.1-8B/model.pth --tasks 'winogrande' --quantization float8_a1x128_w128x128 --compile ... winogrande: {'alias': 'winogrande', 'acc,none': 0.7419100236779794, 'acc_stderr,none': 0.012298278833972477} ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: e87609a ghstack-comment-id: 3474380821 Pull-Request: #3269

[ghstack-poisoned]

Summary: Adds support for the new float8 scaling recipe in the official eval scripts used to generate accuracy numbers in the README. For now, I am using this as a smoke test that the scaling is working on a real model - it is. We can add official benchmark results after we hook up slayton's cuBLAS binding on H100, which should make the UEX of running evals a lot better. Test Plan: Smoke test on LLama-3.1-8B, accuracy looks good ``` // download checkpoint with-proxy python scripts/download.py --hf_token {token} --repo_id meta-llama/Meta-Llama-3.1-8B // prepare checkpoint python scripts/convert_hf_checkpoint.py --checkpoint_dir checkpoints/meta-llama/Meta-Llama-3.1-8B // run bf16 eval on a single task with-proxy time python torchao/_models/llama/eval.py --checkpoint_path checkpoints/meta-llama/Meta-Llama-3.1-8B/model.pth --tasks 'winogrande' ... winogrande: {'alias': 'winogrande', 'acc,none': 0.7426992896606156, 'acc_stderr,none': 0.012285989618865697} // run float8 eval on the same task with-proxy time python torchao/_models/llama/eval.py --checkpoint_path checkpoints/meta-llama/Meta-Llama-3.1-8B/model.pth --tasks 'winogrande' --quantization float8_a1x128_w128x128 --compile ... winogrande: {'alias': 'winogrande', 'acc,none': 0.7419100236779794, 'acc_stderr,none': 0.012298278833972477} ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: e87609a ghstack-comment-id: 3474380821 Pull-Request: #3269

[ghstack-poisoned]

jainapurva · 2025-11-04T20:50:16Z

torchao/_models/llama/eval.py

                model,
                Float8DynamicActivationFloat8WeightConfig(granularity=granularity),
            )
+        if quantization == "float8_a1x128_w128x128":


The evaluation framework for torchao has multiple scripts:
torchao/_models/llama/eval.py
benchmarks/_models/eval_hf_models.py, which will need to be cleaned up as part of BE #3289. For now I feel the quantization technique should also be added to the benchmarking framework here:

ao/benchmarks/microbenchmarks/utils.py

Lines 153 to 155 in 01374eb

def string_to_config(

quantization: Optional[str], sparsity: Optional[str], **kwargs

) -> AOBaseConfig:

This will enable float8_a1x128_w128x128 in the torchao benchmarking module, and running it on hf models

Rest, LGTM!

Update

22d1a14

[ghstack-poisoned]

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 31, 2025

This was referenced Oct 31, 2025

add a_1_128_w_128_128 (DeepSeek) float8 scaling for inference #3257

Merged

add bias handling for a_1_128_w_128_128 float8 scaling #3259

Merged

Makes fallback float8 1x128 by 128x128 gemm output bfloat16 #3265

Open

vkuzo requested review from andrewor14, jainapurva and jerryzh168 October 31, 2025 18:36

vkuzo added the topic: for developers Use this tag if this PR is mainly developer facing label Oct 31, 2025

Update

9a995b5

[ghstack-poisoned]

Update

485ee80

[ghstack-poisoned]

vkuzo mentioned this pull request Nov 3, 2025

make float8 a1x128_w128x128 granularity serializeable #3279

Open

Update

cafe668

[ghstack-poisoned]

jainapurva reviewed Nov 4, 2025

View reviewed changes

jainapurva approved these changes Nov 4, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

support eval of float8_a1x128_w128x128 #3269

support eval of float8_a1x128_w128x128 #3269

vkuzo commented Oct 31, 2025 •

edited

Loading

Uh oh!

vkuzo commented Oct 31, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Oct 31, 2025 •

edited

Loading

Uh oh!

jainapurva Nov 4, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	def string_to_config(
	quantization: Optional[str], sparsity: Optional[str], **kwargs
	) -> AOBaseConfig:

support eval of float8_a1x128_w128x128 #3269

Are you sure you want to change the base?

support eval of float8_a1x128_w128x128 #3269

Conversation

vkuzo commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vkuzo commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3269

❗ 1 Active SEVs

✅ No Failures

Uh oh!

jainapurva Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vkuzo commented Oct 31, 2025 •

edited

Loading

vkuzo commented Oct 31, 2025 •

edited

Loading

pytorch-bot bot commented Oct 31, 2025 •

edited

Loading

jainapurva Nov 4, 2025 •

edited

Loading