-
Notifications
You must be signed in to change notification settings - Fork 30.9k
Extend Transformers Trainer Class to Enable CPU AMP and Integrate Intel Extension for PyTorch #17138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The documentation is not available anymore as the PR was closed or merged. |
cc @stas00 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the detailed proposal and the implementation, @jianan-gu!
Other than the few comments/suggestions I left in the code the next step is we need tests that exercise this new feature.
The failing run_tests_torch_and_flax
CI has a flakey test - please ignore.
Like #17153 I'm not entirely sure if this is best put here or in Optimum, so would like to hear back from @mfuntowicz and @LysandreJik before taking a deeper dive in the PR and review :-) |
Hi, @mfuntowicz @LysandreJik @sgugger , any suggestion about this PR? We are happy to have a discussion here on any concerns. Currently, BF16 AMP is already supported for GPU in Transformers, and with this PR we could extend that on the CPU side both for inference and training. Furtherly, users could get performance benefits by using IPEX with just one args "--use_ipex". Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this. Before merging this we would need to have some tests of the feature. I also have a few comments on the PR.
src/transformers/trainer.py
Outdated
if is_ipex_available() and "--use_ipex" in sys.argv: | ||
import intel_extension_for_pytorch as ipex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't rely on sys.argv
as the arguments may be passed to the Trainer
class as TrainingArguments
. This import should be done inside the Trainer
as needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, and have moved this importation to the inside ipex_optimize_model
if version.parse(torch.__version__) >= version.parse("1.6"): | ||
_is_torch_generator_available = True | ||
_is_native_amp_available = True | ||
from torch.cuda.amp import autocast |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this import removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because there are two kinds of autocast, for CPU and CUDA. Since torch is imported, we directly use torch.cpu.amp.autocast
and torch.cuda.amp.autocast
in contextmgr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining!
src/transformers/trainer.py
Outdated
else: | ||
if args.bf16: | ||
raise ValueError("Tried to use `bf16` but native amp is not available") | ||
if args.device == torch.device("cuda"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is completely unreliable and will be false if args.device
is set with torch.device(0)
or when there are multiple GPUs. It should therefore be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, have made changes to check if it is CPU or others ( GPU, multiple GPU...), thanks.
return model | ||
|
||
def ipex_optimize_model(self, model, training=False, dtype=torch.float32): | ||
if not training: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the ipex
import can go here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have made changes accordingly, thanks.
Co-authored-by: Sylvain Gugger <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The doc looks great - thank you, @jianan-gu!
Added one small grammar fix
Would you like to add a benchmark to show the speed with and without ipex? otherwise it's very hard to tell whether it's worth the effort trying it.
as I mentioned earlier you'd just run the trainer twice w/ and w/o ipex and make sure to add --skip_memory_metrics 0
and it'll print you the speed at which it finished each run.
If it's not too much trouble that is. I'm asking since you probably have done a lot of experiments already and saw the power of this extension, while none of us did.
Co-authored-by: Stas Bekman <[email protected]>
Hi, thanks for your suggestions and reply. |
I don't think so. Those performance numbers can't be reproduced by a user, so they are very practical. As I suggested there is no need for any major benchmarks, here is a simple eval one:
So 5.296 vs 4.423 eval_samples_per_second on my machine. ~16% speedup - that's a good start.
Probably would be better to test with a slightly larger model, as it is likely to get better speedup but it helps to see that it actually works! This is good enough for a doc. And down the road we can surely boost it with better more indepth benchmarks. What do you think? |
Given your description - is there a way to check that ipex is actually doing anything, other than being installed and import'able? i.e. currently it's just:
But if user's CPU is not Intel or not the right kind of Intel should it tell the user it's not supported and assert? or is it already the case? I guess mine is the right CPU so I it doesn't fail. |
Hi @stas00 The optimization in IPEX is a superset of BF16. It optimizes for CPU with AVX-512 or above. It can also functionally work for CPUs with AVX2. So, I expect it also functionally works for AMD CPUs (even though we do not benchmark perf on AMD CPUs). IPEX pre-compiles multiple kernels for various CPU ISAs for an op ahead-of-time and does runtime kernel dispatch according to the underlying CPU ISA capability. IPEX doesn't explicitly check CPU types internally. With this, we are open to whether and what additional checks are needed to add in this PR. |
Sure, got it, so we will prepare for benchmarking example models to collect performance numbers (like you showed above) and the internal review process. And after that, we will update the doc with those performance numbers. |
Perhaps mention it in the doc? just briefly - as in AMD CPUs and older Intel CPUs are likely to result in a better performance as well under IPEX.
I'd be perfectly happy to merge this sooner not to press you if you kindly commit to adding at least some benchmarks later in a new PR - will that be a good arrangement for you guys? |
Hi @stas00
Thanks for the suggestions. We revised the doc accordingly. Could you please review?
Sure, sounds good to us. We will update the perf numbers in a separate PR as soon as they are ready. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good now, @jianan-gu
@sgugger - could you please have another look - just the docs and then it's good to merge. Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks a lot for iterating on this!
…el Extension for PyTorch (huggingface#17138) * init PR * fix import ipex * minor fix on bf16 * refine optimizer * refine args notes * refine code * refine ipex optimize args * refine half_precision_backend * black format * isort format * isort format files * flake8 format * doc builder format * refine codes * remove jit and optim bits * black preview format * Update src/transformers/trainer.py Co-authored-by: Sylvain Gugger <[email protected]> * refine code * refine notes * Update src/transformers/trainer.py Co-authored-by: Sylvain Gugger <[email protected]> * Update src/transformers/trainer.py Co-authored-by: Sylvain Gugger <[email protected]> * code refine * add ipex ut * add performance cpu doc * link to the cpu doc from main perf doc * install ipex into CI's docker * Update perf_train_cpu.mdx * Update docs/source/en/perf_train_cpu.mdx Co-authored-by: Stas Bekman <[email protected]> * Update perf_train_cpu.mdx * Update perf_train_cpu.mdx Co-authored-by: Sylvain Gugger <[email protected]> Co-authored-by: Stas Bekman <[email protected]> Co-authored-by: Stas Bekman <[email protected]>
…el Extension for PyTorch (huggingface#17138) * init PR * fix import ipex * minor fix on bf16 * refine optimizer * refine args notes * refine code * refine ipex optimize args * refine half_precision_backend * black format * isort format * isort format files * flake8 format * doc builder format * refine codes * remove jit and optim bits * black preview format * Update src/transformers/trainer.py Co-authored-by: Sylvain Gugger <[email protected]> * refine code * refine notes * Update src/transformers/trainer.py Co-authored-by: Sylvain Gugger <[email protected]> * Update src/transformers/trainer.py Co-authored-by: Sylvain Gugger <[email protected]> * code refine * add ipex ut * add performance cpu doc * link to the cpu doc from main perf doc * install ipex into CI's docker * Update perf_train_cpu.mdx * Update docs/source/en/perf_train_cpu.mdx Co-authored-by: Stas Bekman <[email protected]> * Update perf_train_cpu.mdx * Update perf_train_cpu.mdx Co-authored-by: Sylvain Gugger <[email protected]> Co-authored-by: Stas Bekman <[email protected]> Co-authored-by: Stas Bekman <[email protected]>
@jianan-gu, fyi, ipex fails with pt-1.12
we are disabling the tests for now, and please let us know when this is fixed, so that we can reenable those. Thank you! |
Hi @stas00 Besides, for the performance data we discussed in this PR, we were mostly working on the IPEX 1.12 release during the past weeks, and we would like to prepare the data based on the new release. It would take some time, thanks. |
I do not understand where is this implementation added?
|
@Oxi84, please see: https://huggingface.co/docs/transformers/perf_train_cpu#mixed-precision-with-ipex It's integrated into HF Trainer. |
What does this PR do?
This PR supports the feature request #17137
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.