Skip to content

Conversation

wanchaol
Copy link
Collaborator

@wanchaol wanchaol commented May 7, 2024

as titled, we can directly specify the rowwise parallel embedding output layouts be shard on sequence dim, so that we don't need the first layer prepare input.

Switching to output_layouts = Shard(1) would also trigger reduce_scatter instead of allreduce for embedding layer, which could give some small perf wins

as titled, we can directly specify the rowwise parallel embedding output
layouts be shard on sequence dim, so that we don't need the first layer
prepare input.

Switching to output_layouts = Shard(1) would also trigger reduce_scatter
instead of allreduce for embedding layer, which could give some small
perf wins
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 7, 2024
@wanchaol wanchaol requested review from awgu and tianyu-l May 7, 2024 21:17
@wanchaol wanchaol requested a review from bdhirsh May 7, 2024 21:18
Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

@wanchaol wanchaol merged commit f5a3ad7 into main May 8, 2024
@bdhirsh
Copy link

bdhirsh commented May 8, 2024

cool :)

tianyu-l added a commit to pytorch/examples that referenced this pull request May 16, 2024
…first transformer block"


Following changes in pytorch/torchtitan#314, to apply a reduce-scatter instead of the more expensive all-reduce + local chunk.

cross PR with pytorch/tutorials#2871

[ghstack-poisoned]
@awgu
Copy link
Collaborator

awgu commented Jun 29, 2024

Reminder to self: We should update the comment and remove the # 3. Shard the first transformer block's inputs.

tianyu-l pushed a commit to tianyu-l/torchtitan_intern24 that referenced this pull request Aug 16, 2024
as titled, we can directly specify the rowwise parallel embedding output
layouts be shard on sequence dim, so that we don't need the first layer
prepare input.

Switching to output_layouts = Shard(1) would also trigger reduce_scatter
instead of allreduce for embedding layer, which could give some small
perf wins
philippguevorguian pushed a commit to YerevaNN/YNNtitan that referenced this pull request Aug 17, 2024
as titled, we can directly specify the rowwise parallel embedding output
layouts be shard on sequence dim, so that we don't need the first layer
prepare input.

Switching to output_layouts = Shard(1) would also trigger reduce_scatter
instead of allreduce for embedding layer, which could give some small
perf wins
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants