-
Notifications
You must be signed in to change notification settings - Fork 360
support eval of float8_a1x128_w128x128 #3269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: gh/vkuzo/160/head
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3269
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ No FailuresAs of commit cafe668 with merge base f856d36 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Summary:
Adds support for the new float8 scaling recipe in the official eval
scripts used to generate accuracy numbers in the README.
For now, I am using this as a smoke test that the scaling is working on
a real model - it is. We can add official benchmark results after we
hook up slayton's cuBLAS binding on H100, which should make the UEX of
running evals a lot better.
Test Plan:
Smoke test on LLama-3.1-8B, accuracy looks good
```
// download checkpoint
with-proxy python scripts/download.py --hf_token {token} --repo_id meta-llama/Meta-Llama-3.1-8B
// prepare checkpoint
python scripts/convert_hf_checkpoint.py --checkpoint_dir checkpoints/meta-llama/Meta-Llama-3.1-8B
// run bf16 eval on a single task
with-proxy time python torchao/_models/llama/eval.py --checkpoint_path checkpoints/meta-llama/Meta-Llama-3.1-8B/model.pth --tasks 'winogrande'
...
winogrande: {'alias': 'winogrande', 'acc,none': 0.7426992896606156, 'acc_stderr,none': 0.012285989618865697}
// run float8 eval on the same task
with-proxy time python torchao/_models/llama/eval.py --checkpoint_path checkpoints/meta-llama/Meta-Llama-3.1-8B/model.pth --tasks 'winogrande' --quantization float8_a1x128_w128x128 --compile
...
winogrande: {'alias': 'winogrande', 'acc,none': 0.7419100236779794, 'acc_stderr,none': 0.012298278833972477}
```
Reviewers:
Subscribers:
Tasks:
Tags:
ghstack-source-id: 01b8d77
ghstack-comment-id: 3474380821
Pull-Request: #3269
Summary:
Adds support for the new float8 scaling recipe in the official eval
scripts used to generate accuracy numbers in the README.
For now, I am using this as a smoke test that the scaling is working on
a real model - it is. We can add official benchmark results after we
hook up slayton's cuBLAS binding on H100, which should make the UEX of
running evals a lot better.
Test Plan:
Smoke test on LLama-3.1-8B, accuracy looks good
```
// download checkpoint
with-proxy python scripts/download.py --hf_token {token} --repo_id meta-llama/Meta-Llama-3.1-8B
// prepare checkpoint
python scripts/convert_hf_checkpoint.py --checkpoint_dir checkpoints/meta-llama/Meta-Llama-3.1-8B
// run bf16 eval on a single task
with-proxy time python torchao/_models/llama/eval.py --checkpoint_path checkpoints/meta-llama/Meta-Llama-3.1-8B/model.pth --tasks 'winogrande'
...
winogrande: {'alias': 'winogrande', 'acc,none': 0.7426992896606156, 'acc_stderr,none': 0.012285989618865697}
// run float8 eval on the same task
with-proxy time python torchao/_models/llama/eval.py --checkpoint_path checkpoints/meta-llama/Meta-Llama-3.1-8B/model.pth --tasks 'winogrande' --quantization float8_a1x128_w128x128 --compile
...
winogrande: {'alias': 'winogrande', 'acc,none': 0.7419100236779794, 'acc_stderr,none': 0.012298278833972477}
```
Reviewers:
Subscribers:
Tasks:
Tags:
ghstack-source-id: e87609a
ghstack-comment-id: 3474380821
Pull-Request: #3269
Summary:
Adds support for the new float8 scaling recipe in the official eval
scripts used to generate accuracy numbers in the README.
For now, I am using this as a smoke test that the scaling is working on
a real model - it is. We can add official benchmark results after we
hook up slayton's cuBLAS binding on H100, which should make the UEX of
running evals a lot better.
Test Plan:
Smoke test on LLama-3.1-8B, accuracy looks good
```
// download checkpoint
with-proxy python scripts/download.py --hf_token {token} --repo_id meta-llama/Meta-Llama-3.1-8B
// prepare checkpoint
python scripts/convert_hf_checkpoint.py --checkpoint_dir checkpoints/meta-llama/Meta-Llama-3.1-8B
// run bf16 eval on a single task
with-proxy time python torchao/_models/llama/eval.py --checkpoint_path checkpoints/meta-llama/Meta-Llama-3.1-8B/model.pth --tasks 'winogrande'
...
winogrande: {'alias': 'winogrande', 'acc,none': 0.7426992896606156, 'acc_stderr,none': 0.012285989618865697}
// run float8 eval on the same task
with-proxy time python torchao/_models/llama/eval.py --checkpoint_path checkpoints/meta-llama/Meta-Llama-3.1-8B/model.pth --tasks 'winogrande' --quantization float8_a1x128_w128x128 --compile
...
winogrande: {'alias': 'winogrande', 'acc,none': 0.7419100236779794, 'acc_stderr,none': 0.012298278833972477}
```
Reviewers:
Subscribers:
Tasks:
Tags:
ghstack-source-id: e87609a
ghstack-comment-id: 3474380821
Pull-Request: #3269
| model, | ||
| Float8DynamicActivationFloat8WeightConfig(granularity=granularity), | ||
| ) | ||
| if quantization == "float8_a1x128_w128x128": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The evaluation framework for torchao has multiple scripts:
torchao/_models/llama/eval.py
benchmarks/_models/eval_hf_models.py, which will need to be cleaned up as part of BE #3289. For now I feel the quantization technique should also be added to the benchmarking framework here:
ao/benchmarks/microbenchmarks/utils.py
Lines 153 to 155 in 01374eb
| def string_to_config( | |
| quantization: Optional[str], sparsity: Optional[str], **kwargs | |
| ) -> AOBaseConfig: |
This will enable float8_a1x128_w128x128 in the torchao benchmarking module, and running it on hf models
Rest, LGTM!
Summary:
Adds support for the new float8 scaling recipe in the official eval
scripts used to generate accuracy numbers in the README.
For now, I am using this as a smoke test that the scaling is working on
a real model - it is. We can add official benchmark results after we
hook up the cuBLAS binding on H100, which should make the UEX of
running evals a lot better.
Test Plan:
Smoke test on LLama-3.1-8B, accuracy looks good
Reviewers:
Subscribers:
Tasks:
Tags: