support delayed scaling of weight in float8 all-gather #312

vkuzo · 2024-07-09T20:36:31Z

Stack from ghstack (oldest at bottom):

Summary:

Adds support for delayed scaling in FSDP2 float8 all-gather. In detail:

add WeightWithDelayedFloat8CastTensor, note that we don't reuse
code with the dynamic version because I'd rather not deal with
plumbing optional tensors through dynamo. We can try that in a
separate PR later.
wire Float8Linear to use (1)
add weight amax syncing back, since we need it for float8 all-gather
add test coverage for eager mode numerics

Next up (in separate PRs) will be training run validation for numerics, and
taking a look at performance.

Test Plan:

./test/test_everything.sh

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: D59685258

Summary: Adds support for delayed scaling in FSDP2 float8 all-gather. In detail: 1. add `WeightWithDelayedFloat8CastTensor`, note that we don't reuse code with the dynamic version because I'd rather not deal with plumbing optional tensors through dynamo. We can try that in a separate PR later. 2. wire `Float8Linear` to use (1) 3. add weight amax syncing back, since we need it for float8 all-gather 4. add test coverage for eager mode numerics Next up (in separate PRs) will be training run validation for numerics, and taking a look at performance. Test Plan: ``` ./test/test_everything.sh ``` Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: Adds support for delayed scaling in FSDP2 float8 all-gather. In detail: 1. add `WeightWithDelayedFloat8CastTensor`, note that we don't reuse code with the dynamic version because I'd rather not deal with plumbing optional tensors through dynamo. We can try that in a separate PR later. 2. wire `Float8Linear` to use (1) 3. add weight amax syncing back, since we need it for float8 all-gather 4. add test coverage for eager mode numerics Next up (in separate PRs) will be training run validation for numerics, and taking a look at performance. Test Plan: ``` ./test/test_everything.sh ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: f1707c1 Pull Request resolved: #312

weifengpy · 2024-07-09T22:17:00Z

I'd rather not deal with
plumbing optional tensors through dynamo

what are the optional tensors ?

float8_experimental/float8_linear.py

drisspg · 2024-07-09T23:59:48Z

float8_experimental/float8_linear_utils.py

            all_amax_tensors = torch.cat(
-                fp8_amax_x_tensor_list + fp8_amax_dL_dY_tensor_list
+                fp8_amax_x_tensor_list
+                + fp8_amax_w_tensor_list


should we only do this if we are using fp8 all gather ?

that could make sense, I'd love to see the data to see if this is going to matter for performance. Focusing on numerics for now, was hoping for performance be tackled in future PRs.

drisspg · 2024-07-10T00:06:03Z

float8_experimental/fsdp_utils.py

@@ -110,3 +112,181 @@ def fsdp_post_all_gather(
            out._scale = scale
            return
        return Float8Tensor(data, scale, param_dtype, self._mm_config), (data,)
+
+
+class WeightWithDelayedFloat8CastTensor(torch.Tensor):


[no change needed] I wish there was a way to share some more code with the dynamic version

yeah, me too. Looking at the code below, really the only code which would be shared is fsdp_post_all_gather, everything else would have to have if/else branches for delayed vs dynamic

float8_experimental/fsdp_utils.py

drisspg · 2024-07-10T00:12:48Z

float8_experimental/fsdp_utils.py

+    def __repr__(self):
+        return f"WeightWithDelayedFloat8CastTensor(tensor={self._tensor}, amax_buffer={self._amax_buffer}, scale_buffer={self._scale_buffer}, mm_config={self._mm_config})"
+
+    def fsdp_pre_all_gather(self, mesh):


ill let @weifengpy confirm this portion

confirming that fsdp part looks good

Summary: Adds support for delayed scaling in FSDP2 float8 all-gather. In detail: 1. add `WeightWithDelayedFloat8CastTensor`, note that we don't reuse code with the dynamic version because I'd rather not deal with plumbing optional tensors through dynamo. We can try that in a separate PR later. 2. wire `Float8Linear` to use (1) 3. add weight amax syncing back, since we need it for float8 all-gather 4. add test coverage for eager mode numerics Next up (in separate PRs) will be training run validation for numerics, and taking a look at performance. Test Plan: ``` ./test/test_everything.sh ``` Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: Adds support for delayed scaling in FSDP2 float8 all-gather. In detail: 1. add `WeightWithDelayedFloat8CastTensor`, note that we don't reuse code with the dynamic version because I'd rather not deal with plumbing optional tensors through dynamo. We can try that in a separate PR later. 2. wire `Float8Linear` to use (1) 3. add weight amax syncing back, since we need it for float8 all-gather 4. add test coverage for eager mode numerics Next up (in separate PRs) will be training run validation for numerics, and taking a look at performance. Test Plan: ``` ./test/test_everything.sh ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: c83e4df Pull Request resolved: #312

bdhirsh · 2024-07-11T18:00:22Z

float8_experimental/fsdp_utils.py

+
+    @classmethod
+    def __torch_dispatch__(cls, func, types, args, kwargs=None):
+        if func == torch.ops.aten.detach.default:


mostly just a nit, but any reason to special-case detach here? Alternatively, you could set it up so that every view ops automatiomatically propagates subclass-ness in the same way

If this is something I wrote, I think it was just something I saw in some other subclasses. Having every view up propagate subclass-ness in the same way sounds good to me.

weifengpy

stamping for the fsdp part

document 2 open questions (not blocker for this PR)

should we merge WeightWithDelayedFloat8CastTensor and WeightWithDynamicFloat8CastTensor into one class and add if-else to unify logic around __torch_dispatch__, fsdp_pre_all_gather/fsdp_post_all_gather. we unifed Float8Linear already
compare perfs between sync_float8_amax_and_scale_history and precompute_float8_dynamic_scale_for_fsdp. If they are similar, people would not need to worry about numeric problem from delayed scaling

vkuzo · 2024-07-12T15:35:57Z

should we merge WeightWithDelayedFloat8CastTensor and WeightWithDynamicFloat8CastTensor into one class and add if-else to unify logic around torch_dispatch, fsdp_pre_all_gather/fsdp_post_all_gather. we unifed Float8Linear already

I'm open if someone is interested in doing that in a follow-up PR. I'm not sure it will be better than what we have now though. Note that Float8Linear was unified to allow for finer grained configuration of scaling (per-tensor instead of per-module), that benefit is not on the table here.

compare perfs between sync_float8_amax_and_scale_history and precompute_float8_dynamic_scale_for_fsdp. If they are similar, people would not need to worry about numeric problem from delayed scaling

yes, that would be great! I think we can do this in follow-up PRs. Note that delayed scaling is theoretically faster than dynamic scaling (less memory reads), but performance is not optimized across the stack yet. I think it's good to have options and allow people to optimize different settings in parallel. Eventually if there is clear data that only one of these is needed, we can delete the not-needed ones.

Summary: Adds support for delayed scaling in FSDP2 float8 all-gather. In detail: 1. add `WeightWithDelayedFloat8CastTensor`, note that we don't reuse code with the dynamic version because I'd rather not deal with plumbing optional tensors through dynamo. We can try that in a separate PR later. 2. wire `Float8Linear` to use (1) 3. add weight amax syncing back, since we need it for float8 all-gather 4. add test coverage for eager mode numerics Next up (in separate PRs) will be training run validation for numerics, and taking a look at performance. Test Plan: ``` ./test/test_everything.sh ``` Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: Adds support for delayed scaling in FSDP2 float8 all-gather. In detail: 1. add `WeightWithDelayedFloat8CastTensor`, note that we don't reuse code with the dynamic version because I'd rather not deal with plumbing optional tensors through dynamo. We can try that in a separate PR later. 2. wire `Float8Linear` to use (1) 3. add weight amax syncing back, since we need it for float8 all-gather 4. add test coverage for eager mode numerics Next up (in separate PRs) will be training run validation for numerics, and taking a look at performance. Test Plan: ``` ./test/test_everything.sh ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: cdc9d96 Pull Request resolved: #312

vkuzo · 2024-07-12T15:42:30Z

@vkuzo has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-07-12T17:56:49Z

This pull request has been merged in de93990.

awgu · 2024-07-19T15:28:39Z

float8_experimental/fsdp_utils.py

+
+    def fsdp_pre_all_gather(self, mesh):
+        # initialize if needed
+        # TODO(before land): ensure settings are consistent between Float8Linear and here


do we still need to resolve this?

awgu · 2024-07-19T15:28:44Z

float8_experimental/fsdp_utils.py

+                self._amax_buffer,
+                self._amax_history_buffer,
+                self._scale_buffer,
+                "max",  # TODO(before land): read this from parent


This was referenced Jul 9, 2024

one more delayed -> dynamic default update #309

Closed

move WeightWithDynamicFloat8CastTensor to fsdp_utils.py #310

Closed

delete swap_linear_with_dynamic from fsdp2 eager test case #311

Closed

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 9, 2024

vkuzo requested review from awgu, drisspg, weifengpy and bdhirsh July 9, 2024 20:37

drisspg reviewed Jul 9, 2024

View reviewed changes

float8_experimental/float8_linear.py Show resolved Hide resolved

drisspg reviewed Jul 9, 2024

View reviewed changes

drisspg reviewed Jul 10, 2024

View reviewed changes

float8_experimental/fsdp_utils.py Outdated Show resolved Hide resolved

drisspg reviewed Jul 10, 2024

View reviewed changes

vkuzo requested a review from drisspg July 10, 2024 23:20

bdhirsh reviewed Jul 11, 2024

View reviewed changes

weifengpy approved these changes Jul 11, 2024

View reviewed changes

facebook-github-bot closed this in de93990 Jul 12, 2024

facebook-github-bot added the Merged label Jul 12, 2024

awgu reviewed Jul 19, 2024

View reviewed changes

support delayed scaling of weight in float8 all-gather #312

support delayed scaling of weight in float8 all-gather #312

Uh oh!

Conversation

vkuzo commented Jul 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

weifengpy commented Jul 9, 2024

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

drisspg Jul 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

weifengpy left a comment

Choose a reason for hiding this comment

Uh oh!

vkuzo commented Jul 12, 2024

Uh oh!

vkuzo commented Jul 12, 2024

Uh oh!

facebook-github-bot commented Jul 12, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vkuzo commented Jul 9, 2024 •

edited

Loading

drisspg Jul 10, 2024 •

edited

Loading