Skip to content

Reduce unnecessary GreedyPerfPartitioner calls from MemoryBalancedPartitioner #2914

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

micrain
Copy link

@micrain micrain commented Apr 24, 2025

Summary:
MemoryBalancedPartitioner works by adjusting the max memory on devices and calling GreedyPerfPartitioner repeatedly. The max memory is adjusted with a binary search procedure to identify a more memory efficient plan than what GreedyPerfPartitioner gives by default.

The search boundaries for the binary search procedure were inefficient which this diff addresses.

  • Upper bound
    • Before: Max device HBM (e.g. 80 GB)
    • After: Max HBM usage of the default plan since there is no point in searching for plans that use more max memory than what the default plan uses.
  • Lower bound:
    • Before: [Avg. HBM per Device] = [Total HBM Needed Across All Shards] / [World Size]
    • After: max([Avg. HBM per Device], [Max HBM Needed Across All Shards]). A feasible solution requires at least the max HBM that the biggest shard needs so there is no point in searching for options below that.

Making these changes can have impact in two ways:

  1. Search procedure is more efficient leading to plans with lower memory
  2. We can reduce search_count to get comparable plans as before while calling GreedyPerfPartitioner less number of times from MemoryBalancedPartitioner.

The default impact without further changes from #1 should lead to a marginal max memory improvement.

Differential Revision: D73598477

…titioner

Summary:
MemoryBalancedPartitioner works by adjusting the max memory on devices and calling GreedyPerfPartitioner repeatedly. The max memory is adjusted with a binary search procedure to identify a more memory efficient plan than what GreedyPerfPartitioner gives by default. 

The search boundaries for the binary search procedure were inefficient which this diff addresses.
* **Upper bound**
  * **Before:** Max device HBM (e.g. 80 GB)
  * **After:** Max HBM usage of the default plan since there is no point in searching for plans that use more max memory than what the default plan uses.
* **Lower bound:**
  * **Before:** [Avg. HBM per Device] = [Total HBM Needed Across All Shards] / [World Size]
  * **After:** max([Avg. HBM per Device], [Max HBM Needed Across All Shards]). A feasible solution requires at least the max HBM that the biggest shard needs so there is no point in searching for options below that.

Making these changes can have impact in two ways:
1. Search procedure is more efficient leading to plans with lower memory
2. We can reduce `search_count` to get comparable plans as before while calling `GreedyPerfPartitioner` less number of times from `MemoryBalancedPartitioner`.

The default impact without further changes from pytorch#1 should lead to a marginal max memory improvement.

Differential Revision: D73598477
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 24, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D73598477

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants