Skip to content

IFU-master-2023-03-01 #1194

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1,352 commits into from
Mar 21, 2023
Merged

IFU-master-2023-03-01 #1194

merged 1,352 commits into from
Mar 21, 2023

Conversation

jithunnair-amd
Copy link
Collaborator

janeyx99 and others added 30 commits February 22, 2023 04:47
Rolling back the default change for Adam and rectifying the docs to reflect that AdamW never defaulted to fused.

Since our fused implementations are relatively newer, let's give them a longer bake-in time before flipping the switch for every user.

Pull Request resolved: pytorch#95241
Approved by: https://github.com/ngimel
Running an operator registered in python returning a symint will result in the following error:
```
RuntimeError: Unable to cast Python instance of type <class 'torch.SymInt'> to C++ type 'long'
```

The interaction of 2 things make the issue being triggered:
- We use boxed kernel here. For boxed kernel, we need convert py::object to IValue in torch/csrc/autograd/python_variable.cpp pushPyOutToStack .
- In the schema parsing code in torch/csrc/jit/frontend/schema_type_parser.cpp SchemaTypeParser::parseFakeAndRealType , if a SymInt is found, we register a Int type instead (not sure why we do this), and register SymInt as the real type.

The result is we would convert an SymInt to int in pushPyOutToStack and cause the issue.

The fix is to use real type when we convert py::object to IValue.

BTW, registering the same op using C++ API does not trigger the issue.
```
TORCH_LIBRARY(clib, m) {
  m.def("sqsum(SymInt a, SymInt b) -> SymInt", [](SymInt a, SymInt b) -> SymInt {
    return a * a + b * b;
  });
}
```
The reason is, the kernel registered in C++ is unboxed kernel and it does not trigger the code path above that converts an py::object to IValue.

Pull Request resolved: pytorch#95240
Approved by: https://github.com/larryliu0820, https://github.com/ezyang
Simply pipes the arg to the existing torch.cuda API by the same name.

Useful for locally debugging OOMs that happened on a smaller GPU.

Pull Request resolved: pytorch#95260
Approved by: https://github.com/davidberard98
Summary: attempt two at enabling search of global/local cache, regardless of `max_autotune`, by default. the main problem is that triton template generation seems to be broken in some cases for CI tests (maybe dynamic shapes), but this is going to take more time to figure out. for now, we can just cancel template generation instead of raising an assertion error and filter out those failed templates.

Test Plan: sandcastle + CI

Differential Revision: D43424922

Pull Request resolved: pytorch#95134
Approved by: https://github.com/jansel
…4970)"

This reverts commit 5d2eb6d.

Reverted pytorch#94970 on behalf of https://github.com/jeanschmidt due to Requires codev to land internal test changes
- give warnings of converting int64 for reduction ops
- use cast tensor for reduction sum on trace
- unblock trace from running
Pull Request resolved: pytorch#95231
Approved by: https://github.com/razarmehr
…ch#95078)

- Fixes convolution crashes in backward with weights
- Removes unnecessary contiguous calls
Pull Request resolved: pytorch#95078
Approved by: https://github.com/kulinseth
This would fix the issue with `__rdiv__` with float16
Pull Request resolved: pytorch#94952
Approved by: https://github.com/kulinseth
Fixes formatting so that the merge rule shows up on a different line than the "Raised by" text

Follow up to pytorch#94932

New version
<img width="433" alt="image" src="https://user-images.githubusercontent.com/4468967/220441349-ac99096d-590a-42c1-b995-4a23b2d9b810.png">
Pull Request resolved: pytorch#95234
Approved by: https://github.com/huydhn
Remove mps specialized path in BCE backward as `logit` op has been implemented for mps.

Pull Request resolved: pytorch#95220
Approved by: https://github.com/soulitzer
Summary: nccl backend does not support `tag` as mentioned in pytorch#94819. Adding a note in the documentation for it.

Example:

<img width="888" alt="image" src="https://user-images.githubusercontent.com/14858254/220464900-094c8063-797a-4bdc-8e25-657f17593fe9.png">

Differential Revision: D43475756

Pull Request resolved: pytorch#95236
Approved by: https://github.com/awgu, https://github.com/rohan-varma
…ch#95245)

Currently, transformer creates proxy objects directly for get_attr method. node.meta is lost in this step. In order to keep it, we invoke tracer.create_proxy. Meta data is copied over in tracer.create_proxy and tracer.create_node.

Pull Request resolved: pytorch#95245
Approved by: https://github.com/SherlockNoMad, https://github.com/tugsbayasgalan
I am still reading Dynamo source code...

This is an easy PR to simplify `Source.is_nn_module()` to reuse `GuardSource.is_nn_module()` instead of having the `in (...)` check implemented twice. While simplifying that, I thought I might as well add some type annotations for `Source` methods.
Pull Request resolved: pytorch#95292
Approved by: https://github.com/ezyang
This handles the disabling masks if numel is a multiple of BLOCK.
It currently introduces a performance regression, but the triton
it generates does not seem to have any issues: all the change does
is cause xmask to be removed from load/stores in cases where it safely
can be removed. It seems it must be coming from some issue in triton
optimizer.

FWIW, if you try this change with current triton master (instead of
pinned version) it does _not_ cause a performance regression.
However, upgradign to triton master by itself already causes
significant performance regressions so it's not an option
to just bump up the pin.

I'm going to leave this PR open until we manage to increase
the triton pin past the big refactoring. Once we do that
I will check if it still causes a performance regression.

UPDATE:

The triton pin has been moved and I retried this PR. As expected, there's no longer a performance regression for hf_Bert:

```
tspin python benchmarks/dynamo/torchbench.py  --performance  --backend inductor --float16 --training --batch-size-file $(realpath benchmarks/dynamo/torchbench_models_list.txt) --only hf_Bert -n 5 --diff-branch viable/strict 2> err
batch size: 16
cuda train hf_Bert                             numel_BLOCK                1.175x p=0.00
batch size: 16
cuda train hf_Bert                             viable/strict              1.161x p=0.00
```
Re-opening this, should be okay to merge now I expect.

Pull Request resolved: pytorch#92749
Approved by: https://github.com/jansel
Summary:
bypass-github-export-checks

use `dinfo.name` instead of `repr(dinfo)`, as initial results have shown that `dinfo.total_memory` may unexpectedly fluctuate

Test Plan: sandcastle + CI

Differential Revision: D43503558

Pull Request resolved: pytorch#95302
Approved by: https://github.com/bertmaher
…ing (pytorch#95249)

Summary: This change adds input shape when CoreML throws an errors.

Test Plan: testMCSModelInvalidInputShape tests that the assert throws when invalid input shapes are provided.

Differential Revision: D43449112

Pull Request resolved: pytorch#95249
Approved by: https://github.com/mcr229
)

This PR adds back some explanation for why we have the heuristic to only register the post-backward hook on the first forward in the case of multiple forwards.
Pull Request resolved: pytorch#95326
Approved by: https://github.com/fegin
Temporary Fix for pytorch#95312
In triton, 1 warp computes 16x16 tile of output, so for 32x32 block we only need 4 warps. 8 warps IMA, which is a bug, but it's not a good config anyway.
Triton main is supposed to have better behavior for these pathological, but we are not on main yet.

Pull Request resolved: pytorch#95339
Approved by: https://github.com/ezyang, https://github.com/Chillee
jjsjann123 and others added 8 commits March 1, 2023 19:01
… CI node (pytorch#95402)

Fixes pytorch#95155 which breaks CI and no nvfuser python tests are run on CI nodes.

Thanks to @davidberard98 for noticing this.

Pull Request resolved: pytorch#95402
Approved by: https://github.com/davidberard98
…torch#95200)

Changes:

- => this PR: pytorch#95200

1. Recognize `.py.in` and `.pyi.in` files as Python in VS Code for a better development experience.
2. Fix deep setting merge in `tools/vscode_settings.py`.

- pytorch#95267

3. Use `Namedtuple` rather than `namedtuple + __annotations__` for `torch.nn.utils.rnn.PackedSequence_`:

    `namedtuple + __annotations__`:

    ```python
    PackedSequence_ = namedtuple('PackedSequence_',
                                 ['data', 'batch_sizes', 'sorted_indices', 'unsorted_indices'])

    # type annotation for PackedSequence_ to make it compatible with TorchScript
    PackedSequence_.__annotations__ = {'data': torch.Tensor, 'batch_sizes': torch.Tensor,
                                       'sorted_indices': Optional[torch.Tensor],
                                       'unsorted_indices': Optional[torch.Tensor]}
    ```

    `Namedtuple`: Python 3.6+

    ```python
    class PackedSequence_(NamedTuple):
        data: torch.Tensor
        batch_sizes: torch.Tensor
        sorted_indices: Optional[torch.Tensor]
        unsorted_indices: Optional[torch.Tensor]
    ```

- pytorch#95268

4. Sort import statements and remove unnecessary imports in `.pyi`, `.pyi.in` files.
5. Format `.pyi`, `.pyi.in` files and remove unnecessary ellipsis `...` in type stubs.
Pull Request resolved: pytorch#95200
Approved by: https://github.com/janeyx99
…e annotated `NamedTuple` (pytorch#95267)

Changes:

- pytorch#95200

1. Recognize `.py.in` and `.pyi.in` files as Python in VS Code for a better development experience.
2. Fix deep setting merge in `tools/vscode_settings.py`.

- => this PR: pytorch#95267

3. Use `Namedtuple` rather than `namedtuple + __annotations__` for `torch.nn.utils.rnn.PackedSequence_`:

    `namedtuple + __annotations__`:

    ```python
    PackedSequence_ = namedtuple('PackedSequence_',
                                 ['data', 'batch_sizes', 'sorted_indices', 'unsorted_indices'])

    # type annotation for PackedSequence_ to make it compatible with TorchScript
    PackedSequence_.__annotations__ = {'data': torch.Tensor, 'batch_sizes': torch.Tensor,
                                       'sorted_indices': Optional[torch.Tensor],
                                       'unsorted_indices': Optional[torch.Tensor]}
    ```

    `Namedtuple`: Python 3.6+

    ```python
    class PackedSequence_(NamedTuple):
        data: torch.Tensor
        batch_sizes: torch.Tensor
        sorted_indices: Optional[torch.Tensor]
        unsorted_indices: Optional[torch.Tensor]
    ```

- pytorch#95268

4. Sort import statements and remove unnecessary imports in `.pyi`, `.pyi.in` files.
5. Format `.pyi`, `.pyi.in` files and remove unnecessary ellipsis `...` in type stubs.
Pull Request resolved: pytorch#95267
Approved by: https://github.com/janeyx99
Summary: xcit_large_24_p8_224 occasionally hits TIMEOUT on CI. Bump up
the limit to reduce flakiness.

Pull Request resolved: pytorch#95787
Approved by: https://github.com/ezyang, https://github.com/ZainRizvi
Continuation of PR pytorch#93153 where I implemented logaddexp for complex, but didn't expose it to `torch.logaddexp`. So this PR is to expose the complex logaddexp to `torch.logaddexp`.

Pull Request resolved: pytorch#95717
Approved by: https://github.com/lezcano
Fixes pytorch#88098

This is the rebased and retry merging branch of the reverted PR: pytorch#94597

Pull Request resolved: pytorch#94899
Approved by: https://github.com/kit1980
@jithunnair-amd
Copy link
Collaborator Author

jenkins retest this please

3 similar comments
@jithunnair-amd
Copy link
Collaborator Author

jenkins retest this please

@jithunnair-amd
Copy link
Collaborator Author

jenkins retest this please

@jithunnair-amd
Copy link
Collaborator Author

jenkins retest this please

@jithunnair-amd
Copy link
Collaborator Author

jithunnair-amd commented Mar 7, 2023

http://rocmhead:8080/job/pytorch/job/pytorch-ci/625/:

  • Used .ci scripts instead of .jenkins.
  • distributed-1 failed with timeout: bad node ixt-hq-15?
  • test-1 failed with test_ops and test_cuda failures:

ERROR [0.008s]: test_torch_manual_seed_seeds_cuda_devices (main.TestCuda)
FAILED test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_float_power_executor_aten_cuda_float32 - AssertionError: tensor(False, device='cuda:0') is not true : Reference result was farther (3708.725173985593) from the precise computation than the torch result was (0.0)!
FAILED test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_float_power_executor_aten_cuda_float64 - AssertionError: Tensor-likes are not close!

@jithunnair-amd
Copy link
Collaborator Author

jenkins retest this please

@jithunnair-amd
Copy link
Collaborator Author

jithunnair-amd commented Mar 7, 2023

http://rocmhead:8080/job/pytorch/job/pytorch-ci/626/:

  • distributed-2 failed with timeout for distributed/algorithms/quantization/test_quantization.py: bad node ixt-hq-15?
  • test-1 failed with timeout for test_decomp: bad node rocm-framework-test-0.amd.com?
  • test-2 failed with profiler/test_memory_profiler:

FAIL [0.156s]: test_memory_timeline (main.TestMemoryProfilerE2E) -
Mar 07 08:38:24 AssertionError: ' [1186 chars] destroy TEMPORARY [4673 chars]4 kB' != ' [1186 chars] create TEMPORARY [4673 chars]8 kB'
Mar 07 08:38:24 Diff is 11636 characters long. Set self.maxDiff to None to see it. : To accept the new output, re-run test with envvar EXPECTTEST_ACCEPT=1 (we recommend staging/committing your changes before doing this)

@jithunnair-amd
Copy link
Collaborator Author

jenkins retest this please (with rocAutomation scripts updated to pip install requirements-ci.txt)

@jithunnair-amd
Copy link
Collaborator Author

jithunnair-amd commented Mar 18, 2023

arrgh: http://rocmhead:8080/job/pytorch/job/pytorch-ci/641/:

  • test-1 failed with timeout due to 5hrs spent doing just git fetch rolling eyes
03:20:28  > git fetch --no-tags --force --progress -- https://github.com/ROCmSoftwarePlatform/pytorch +refs/heads/master:refs/remotes/origin/master +refs/pull/1194/*:refs/remotes/origin/pr/1194/* # timeout=600
08:19:48  > git config remote.origin.url https://github.com/ROCmSoftwarePlatform/pytorch # timeout=600
  • test-2 failed with profiler/test_memory_profiler. Should be fixed via cherry-pick: b17af81
  • distributed-2 failed with timeout for distributed/test_distributed_spawn.py: bad node ixt-sjc2-47? Taken offline

@jithunnair-amd
Copy link
Collaborator Author

jithunnair-amd commented Mar 18, 2023

http://rocmhead:8080/job/pytorch/job/pytorch-ci-multibranch/job/PR-1194/1/display/redirect:

  • test-1: test_torch_manual_seed_seeds_cuda_devices (__main__.TestCuda) failure seen before. However, since test itself didn't fail on local reproduction attempts, I cannot just skip this test. Leaving it unskipped to be investigated via issue https://github.com/ROCmSoftwarePlatform/frameworks-internal/issues/3869
  • test-2: test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_linalg_det_singular_cuda_complex128 failed with:
Mar 18 02:57:39 torch.autograd.gradcheck.GradcheckError: While considering the imaginary part of complex inputs only, Jacobian computed with forward mode mismatch for output 0 with respect to input 0,
Mar 18 02:57:39 numerical:tensor([0.0969+0.3037j], device='cuda:0', dtype=torch.complex128)
Mar 18 02:57:39 analytical:tensor([0.+0.j], device='cuda:0', dtype=torch.complex128,
Mar 18 02:57:39        grad_fn=<CopyBackwards>)

Re-disabled this test by reopening the issue pytorch#93045 which got closed automatically by a bot since no failures seen upstream in 200 runs. But local testing is able to reproduce these failures consistently as per @jaglinux and my observations.

  • distributed-1 and -2 passed (hallelujah!), indicating that earlier failures are most likely due to node issues

@jithunnair-amd
Copy link
Collaborator Author

jenkins notest pytorch
retest apex/torchvision/deepspeed please

1 similar comment
@jithunnair-amd
Copy link
Collaborator Author

jenkins notest pytorch
retest apex/torchvision/deepspeed please

@jithunnair-amd
Copy link
Collaborator Author

jithunnair-amd commented Mar 21, 2023

http://rocmhead:8080/job/pytorch/job/pytorch-ci/653/
4 torchvision tests failed:

FAILED test/test_transforms.py::test_functional_deprecation_warning[True-from torchvision.transforms import functional_pil]
FAILED test/test_transforms.py::test_functional_deprecation_warning[True-from torchvision.transforms import functional_tensor]
FAILED test/test_transforms.py::test_functional_deprecation_warning[True-from torchvision.transforms.functional_tensor import resize]
FAILED test/test_transforms.py::test_functional_deprecation_warning[True-from torchvision.transforms.functional_pil import resize]

These tests are skipped upstream though eg. https://ossci-raw-job-status.s3.amazonaws.com/log/pytorch/vision/12126909908

2023-03-20T11:50:16.2076760Z test/test_transforms.py::test_functional_deprecation_warning[True-from torchvision.transforms import functional_pil] �[33mSKIPPED�[0m�[33m [ 17%]�[0m
2023-03-20T11:50:16.2083026Z test/test_transforms.py::test_functional_deprecation_warning[True-from torchvision.transforms import functional_tensor] �[33mSKIPPED�[0m�[33m [ 17%]�[0m
2023-03-20T11:50:16.2090372Z test/test_transforms.py::test_functional_deprecation_warning[True-from torchvision.transforms.functional_tensor import resize] �[33mSKIPPED�[0m�[33m [ 17%]�[0m
2023-03-20T11:50:16.2095966Z test/test_transforms.py::test_functional_deprecation_warning[True-from torchvision.transforms.functional_pil import resize] �[33mSKIPPED�[0m�[33m [ 17%]�[0m

Need to figure out why these tests are not being skipped in our runs, but doesn't seem to be a blocker for IFU. cc @lcskrishna any ideas?

@jithunnair-amd jithunnair-amd merged commit de542d1 into master Mar 21, 2023
@lcskrishna
Copy link

@jithunnair-amd In the latest release of torchvision, there is an introduction of new transformsv2 API and few of the old APIs like torchvision.transforms.functional_pil are getting deprecated and will become private from 0.17.
Probably these tests are coming from those. Will look into the code and see why they are not being skipped for us.

@jithunnair-amd
Copy link
Collaborator Author

@jithunnair-amd In the latest release of torchvision, there is an introduction of new transformsv2 API and few of the old APIs like torchvision.transforms.functional_pil are getting deprecated and will become private from 0.17. Probably these tests are coming from those. Will look into the code and see why they are not being skipped for us.

pytorch/vision#7501

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.