Skip to content

Conversation

@liuzqt
Copy link
Contributor

@liuzqt liuzqt commented Jul 3, 2024

What changes were proposed in this pull request?

Add task on/off heap execution memory in TaskMetrics, tracked in TaskMemoryManager, assuming acquireExecutionMemory is the only one narrow waist for acquiring execution memory.

Why are the changes needed?

Currently there is no task on/off heap execution memory metrics.

There is a peakExecutionMemory metrics, however, the semantic is a confusing: it only cover the execution memory used by shuffle/join/aggregate/sort, which is accumulated in specific operators and thus not really reflect the real execution memory.

Therefore it's necessary to add these two metrics.

Also I created two followup sub tickets:

The ultimate goal is to have these two metrics ready (as accumulated stage metrics in Spark UI as well) and deprecate peakExecutionMemory.

Does this PR introduce any user-facing change?

Supposedly no. But two followup sub tickets will have user-facing change: new metrics exposed to Spark UI, and old metrics deprecation.

How was this patch tested?

new test

Was this patch authored or co-authored using generative AI tooling?

NO

@github-actions github-actions bot added the CORE label Jul 3, 2024
@liuzqt liuzqt changed the title [SPARK-48628] Add task peak on/off heap memory metrics [SPARK-48628][CORE] Add task peak on/off heap memory metrics Jul 3, 2024
@JoshRosen JoshRosen requested a review from mridulm July 5, 2024 20:07
Copy link
Contributor

@JoshRosen JoshRosen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am broadly supportive of this change:

The existing peakExecutionMemory metric is fairly inconsistent in its coverage and misses many important sources of allocation. It was originally added while the MemoryManager abstractions were being developed and was never fully updated inlight of that new abstraction. It also predated support for off-heap memory. For all of these reasons, I'm supportive of deprecating and replacing it.

I left a couple of minor nit suggestions, including a suggestion on how we can more explicitly call out the distinction between the old and new metrics in the Scaladocs. I am supportive of deprecating and removing the old metric in favor of these new ones.

@liuzqt liuzqt requested a review from JoshRosen July 8, 2024 18:16
Copy link
Contributor

@jiangxb1987 jiangxb1987 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


if (mode == MemoryMode.OFF_HEAP) {
peakOffHeapMemory = Math.max(peakOffHeapMemory,
memoryManager.getOffHeapExecutionMemoryUsageForTask(taskAttemptId));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This introduces the extra lock synchronization to the underlying memory pool. I wonder if we could do a math calculation for the latest peak memory like this: peakMemory - releasedMemory + gotMemory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically we can maintain both currentMem and peakMem within TaskMemoryManager so that we don’t need to ask memoryManager, but on the other hand memoryManager is designed to maintain per-task mem usage so by doing this we kinda maintain this in two places. @JoshRosen WDYT

Copy link
Contributor

@JoshRosen JoshRosen Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although it's true that there might a bit of redundancy in counting in both places, it seems like there may be reasonable performance justifications for introducing such redundancy.

I don't think it will end up being that much additional code:

Within TaskMemoryManager, I think we'd just need to add a pair of long counter fields, one for on-heap and another for off-heap, then increment them in acquireExecutionMemory and decrement them in releaseExecutionMemory (since those are narrow waists).

Maybe we should give that a try and see how much net code it ends up adding?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that releaseExecutionMemory is not locked, so we need to synchronize on these counters. But I suppose that would be better than synchronization on MemoryManager granularity.

@liuzqt liuzqt requested a review from Ngone51 July 16, 2024 03:50
}

if (mode == MemoryMode.OFF_HEAP) {
synchronized (offHeapMemoryLock) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole function acquireExecutionMemory is under the protection of synchronized (this), and the release memory can be got from trySpillAndAcquire():

long released = consumerToSpill.spill(requested, requestingConsumer);

So I don't think we need extra lock here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

acquireExecutionMemory is synchronized but releaseExecutionMemory is not synchronized.

While we maintain current memory in both places, we can either

  • synchronized (this) on releaseExecutionMemory
  • or add another lock, for smaller lock granularity

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

Actually, could we calculate consumers.map(.used).sum + got as the peak memory at the end of acquireExecutionMemory?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's doable. That way we don't even need to maintain the current memory, instead we update the peak memory after each acquireExecutionMemory call.

Updated the code, please take another look

@liuzqt liuzqt requested a review from Ngone51 July 18, 2024 00:52
@mridulm
Copy link
Contributor

mridulm commented Jul 20, 2024

I have not looked at this PR in detail, but we already have OnHeapExecutionMemory and OffHeapExecutionMemory.
They should be capturing this. no ?

@liuzqt
Copy link
Contributor Author

liuzqt commented Jul 20, 2024

OnHeapExecutionMemory

Hi @mridulm I think we have executor level on/off heap execution memory metrics, but not task/stage level. (I might be wrong...feel free to point me to relevant code path)

@liuzqt
Copy link
Contributor Author

liuzqt commented Jul 24, 2024

seeing this test failure

Failure: "connector/connect/common/src/main" had no .proto files
Error: Failure: "connector/connect/common/src/main" had no .proto files

Doesn't seem to be relevant...

Copy link
Contributor

@mridulm mridulm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a look at peakExecutionMemory within spark-core.
We should be exposing the new metrics as part of the api - both at task level, and at stage level (distributions for ex).

/**
* Peak off heap execution memory as tracked by TaskMemoryManager.
*/
def peakOffHeapExecutionMemory: Long = _peakOffHeapExecutionMemory.sum
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discuss:
Is it required that peakExecutionMemory <= peakOnHeapExecutionMemory + peakOffHeapExecutionMemory ?
Any cases where this might get violated ?

I am trying to reason about completeness of these metrics (given we want to eventually deprecate the existing one).
I expect the above to hold, but want to make sure I am not missing anything.
+CC @JoshRosen

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

peakExecutionMemory <= peakOnHeapExecutionMemory + peakOffHeapExecutionMemory?

I think yes, becauseTaskMemoryManager.acquireExecutionMemory is the only narrow waist for any execution memory acquisition and we maintain the memory here.

Instead, the legacy peakExecutionMemory is maintained in some operators (join, agg, sort), which is totally up to operator implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, I agree that the peakExecutionMemory <= peakOnHeapExecutionMemory + peakOffHeapExecutionMemory should hold:

If we trace through the existing callers of incPeakExecutionMemory it looks like all of the usages flow from counts that correspond to the acquireExecutionMemory waist.

@liuzqt
Copy link
Contributor Author

liuzqt commented Jul 25, 2024

Take a look at peakExecutionMemory within spark-core. We should be exposing the new metrics as part of the api - both at task level, and at stage level (distributions for ex).

We should definitely expose this to api. But can we land this core change and then make API/UI changes in follow PRs?

Actually I've created a sub task https://issues.apache.org/jira/browse/SPARK-48788 for that

@liuzqt liuzqt requested a review from mridulm August 1, 2024 04:52
@liuzqt
Copy link
Contributor Author

liuzqt commented Aug 1, 2024

Hi @JoshRosen @mridulm do you mind taking another look at this PR?

@jiangxb1987
Copy link
Contributor

LGTM

@jiangxb1987
Copy link
Contributor

Merged to master, thanks @mridulm @JoshRosen @Ngone51 for review!

*/
// TODO: SPARK-48789: the naming is confusing since this does not really reflect the whole
// execution memory. We'd better deprecate this once we have a replacement.
def peakExecutionMemory: Long = _peakExecutionMemory.sum
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we change its implementation to be peakOnHeapExecutionMemory + peakOffHeapExecutionMemory? The current implementation doesn't make much sense due to https://github.com/apache/spark/pull/47192/files#r1692144786

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think we can plan this breaking change at spark 4.0

Copy link
Contributor

@mridulm mridulm Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

peakOnHeapExecutionMemory, peakOffHeapExecutionMemory can peak at different times, so we can't replace it with the sum.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, makes sense. Let's leave it then.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @liuzqt , @JoshRosen, @cloud-fan , @jiangxb1987 , @Ngone51 , @mridulm .

This commit seems to cause a regression in some cases.

Specifically, ExternalAppendOnlyUnsafeRowArrayBenchmark is severely affected by this commit and is failing until now in CIs because it's almost hung.

I also verified that it's the same locally.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Aug 14, 2024

Although I tried to do a follow-up PR to provide a quick fix at TaskMemoryManager.java, it was a little tricky.

I created a reverting PR for now. It would be great if this PR lands again properly with the relevant micro-benchmark results.

Otherwise, any follow-up PR with ExternalAppendOnlyUnsafeRowArrayBenchmark benchmark result is also welcome if you can provide, @liuzqt .

@liuzqt
Copy link
Contributor Author

liuzqt commented Aug 14, 2024

@dongjoon-hyun thanks for the catch, I'll investigate the regression and try to fix it.

dongjoon-hyun added a commit that referenced this pull request Aug 14, 2024
### What changes were proposed in this pull request?

This reverts commit 717a6da.

### Why are the changes needed?

To fix a performance regression.

During the regular performance audit,
- #47743

`ExternalAppendOnlyUnsafeRowArrayBenchmark` detected a performance regression caused by SPARK-48626.
- #47192

### Does this PR introduce _any_ user-facing change?

No. This is not released yet.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #47747 from dongjoon-hyun/SPARK-48628.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
@dongjoon-hyun
Copy link
Member

Thank you so much, @liuzqt !

@mridulm
Copy link
Contributor

mridulm commented Aug 14, 2024

Thanks for the details @dongjoon-hyun !
We can revisit this PR with the benchmark fixed as well.

liuzqt added a commit to liuzqt/spark that referenced this pull request Aug 15, 2024
### What changes were proposed in this pull request?

Add task on/off heap execution memory in `TaskMetrics`, tracked in `TaskMemoryManager`, **assuming `acquireExecutionMemory` is the only one narrow waist for acquiring execution memory.**

### Why are the changes needed?

Currently there is no task on/off heap execution memory metrics.

There is a [peakExecutionMemory](https://github.com/apache/spark/blob/3cd35f8cb6462051c621cf49de54b9c5692aae1d/core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala#L114)  metrics, however, the semantic is a confusing: it only cover the execution memory used by shuffle/join/aggregate/sort, which is accumulated in specific operators and thus not really reflect the real execution memory.

Therefore it's necessary to add these two metrics.

Also I created two followup sub tickets:

- https://issues.apache.org/jira/browse/SPARK-48788 : accumulate task metrics in stage, and display in Spark UI
- https://issues.apache.org/jira/browse/SPARK-48789 : deprecate `peakExecutionMemory` once we have replacement for it.

The ultimate goal is to have these two metrics ready (as accumulated stage metrics in Spark UI as well) and deprecate `peakExecutionMemory`.

### Does this PR introduce _any_ user-facing change?

Supposedly no. But two followup sub tickets will have user-facing change: new metrics exposed to Spark UI, and old metrics deprecation.

### How was this patch tested?
new test

### Was this patch authored or co-authored using generative AI tooling?
NO

Closes apache#47192 from liuzqt/SPARK-48628.

Authored-by: Ziqi Liu <[email protected]>
Signed-off-by: Xingbo Jiang <[email protected]>
JoshRosen pushed a commit that referenced this pull request Aug 21, 2024
### What changes were proposed in this pull request?

This PR is trying to revive #47192, which was [reverted](#47747) due to regression in `ExternalAppendOnlyUnsafeRowArrayBenchmark`.

**Root cause**
We eventually decided to aggregate peak memory usage from all consumers on each `acquireExecutionMemory` invocation. (see [this discussion](#47192 (comment))), which is O(n) complexity where `n` is the number of consumers.

`ExternalAppendOnlyUnsafeRowArrayBenchmark` is implemented in a way that all iterations are run in a single task context, therefore the number of consumers is exploding.

Notice that `TaskMemoryManager.consumers` is never cleaned up the whole lifecycle, and `TaskMemoryManager.acquireExecutionMemory` is a very frequent operation, doing a linear complexity(in terms of number of consumers) operation here might not be a good choice. This benchmark might be a corner case, but it's still possible to have a large number of consumers in a large query plan.

I fallback to the previous implementation: maintain current execution memory with an extra lock. cc Ngone51

#### Benchmark result
[ExternalAppendOnlyUnsafeRowArrayBenchmark-results](https://github.com/liuzqt/spark/actions/runs/10415213026)
[ExternalAppendOnlyUnsafeRowArrayBenchmark-jdk21-results](https://github.com/liuzqt/spark/actions/runs/10414246805)

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
New unit tests.

### Was this patch authored or co-authored using generative AI tooling?
NO

Closes #47776 from liuzqt/SPARK-48628.

Authored-by: Ziqi Liu <[email protected]>
Signed-off-by: Josh Rosen <[email protected]>
a0x8o added a commit to a0x8o/spark that referenced this pull request Aug 21, 2024
### What changes were proposed in this pull request?

This PR is trying to revive apache/spark#47192, which was [reverted](apache/spark#47747) due to regression in `ExternalAppendOnlyUnsafeRowArrayBenchmark`.

**Root cause**
We eventually decided to aggregate peak memory usage from all consumers on each `acquireExecutionMemory` invocation. (see [this discussion](apache/spark#47192 (comment))), which is O(n) complexity where `n` is the number of consumers.

`ExternalAppendOnlyUnsafeRowArrayBenchmark` is implemented in a way that all iterations are run in a single task context, therefore the number of consumers is exploding.

Notice that `TaskMemoryManager.consumers` is never cleaned up the whole lifecycle, and `TaskMemoryManager.acquireExecutionMemory` is a very frequent operation, doing a linear complexity(in terms of number of consumers) operation here might not be a good choice. This benchmark might be a corner case, but it's still possible to have a large number of consumers in a large query plan.

I fallback to the previous implementation: maintain current execution memory with an extra lock. cc Ngone51

#### Benchmark result
[ExternalAppendOnlyUnsafeRowArrayBenchmark-results](https://github.com/liuzqt/spark/actions/runs/10415213026)
[ExternalAppendOnlyUnsafeRowArrayBenchmark-jdk21-results](https://github.com/liuzqt/spark/actions/runs/10414246805)

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
New unit tests.

### Was this patch authored or co-authored using generative AI tooling?
NO

Closes #47776 from liuzqt/SPARK-48628.

Authored-by: Ziqi Liu <[email protected]>
Signed-off-by: Josh Rosen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants