[SPARK-48628][CORE] Add task peak on/off heap memory metrics #47192

liuzqt · 2024-07-03T01:21:56Z

What changes were proposed in this pull request?

Add task on/off heap execution memory in TaskMetrics, tracked in TaskMemoryManager, assuming acquireExecutionMemory is the only one narrow waist for acquiring execution memory.

Why are the changes needed?

Currently there is no task on/off heap execution memory metrics.

There is a peakExecutionMemory metrics, however, the semantic is a confusing: it only cover the execution memory used by shuffle/join/aggregate/sort, which is accumulated in specific operators and thus not really reflect the real execution memory.

Therefore it's necessary to add these two metrics.

Also I created two followup sub tickets:

https://issues.apache.org/jira/browse/SPARK-48788 : accumulate task metrics in stage, and display in Spark UI
https://issues.apache.org/jira/browse/SPARK-48789 : deprecate peakExecutionMemory once we have replacement for it.

The ultimate goal is to have these two metrics ready (as accumulated stage metrics in Spark UI as well) and deprecate peakExecutionMemory.

Does this PR introduce any user-facing change?

Supposedly no. But two followup sub tickets will have user-facing change: new metrics exposed to Spark UI, and old metrics deprecation.

How was this patch tested?

new test

Was this patch authored or co-authored using generative AI tooling?

NO

core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java

core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala

core/src/test/scala/org/apache/spark/memory/MemoryManagerSuite.scala

JoshRosen

I am broadly supportive of this change:

The existing peakExecutionMemory metric is fairly inconsistent in its coverage and misses many important sources of allocation. It was originally added while the MemoryManager abstractions were being developed and was never fully updated inlight of that new abstraction. It also predated support for off-heap memory. For all of these reasons, I'm supportive of deprecating and replacing it.

I left a couple of minor nit suggestions, including a suggestion on how we can more explicitly call out the distinction between the old and new metrics in the Scaladocs. I am supportive of deprecating and removing the old metric in favor of these new ones.

jiangxb1987

LGTM

core/src/test/scala/org/apache/spark/util/JsonProtocolSuite.scala

Ngone51 · 2024-07-15T05:54:40Z

core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java

+
+      if (mode == MemoryMode.OFF_HEAP) {
+        peakOffHeapMemory = Math.max(peakOffHeapMemory,
+          memoryManager.getOffHeapExecutionMemoryUsageForTask(taskAttemptId));


This introduces the extra lock synchronization to the underlying memory pool. I wonder if we could do a math calculation for the latest peak memory like this: peakMemory - releasedMemory + gotMemory.

Theoretically we can maintain both currentMem and peakMem within TaskMemoryManager so that we don’t need to ask memoryManager, but on the other hand memoryManager is designed to maintain per-task mem usage so by doing this we kinda maintain this in two places. @JoshRosen WDYT

Although it's true that there might a bit of redundancy in counting in both places, it seems like there may be reasonable performance justifications for introducing such redundancy.

I don't think it will end up being that much additional code:

Within TaskMemoryManager, I think we'd just need to add a pair of long counter fields, one for on-heap and another for off-heap, then increment them in acquireExecutionMemory and decrement them in releaseExecutionMemory (since those are narrow waists).

Maybe we should give that a try and see how much net code it ends up adding?

I noticed that releaseExecutionMemory is not locked, so we need to synchronize on these counters. But I suppose that would be better than synchronization on MemoryManager granularity.

Ngone51 · 2024-07-16T13:44:06Z

core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java

      }
+
+      if (mode == MemoryMode.OFF_HEAP) {
+        synchronized (offHeapMemoryLock) {


The whole function acquireExecutionMemory is under the protection of synchronized (this), and the release memory can be got from trySpillAndAcquire():

long released = consumerToSpill.spill(requested, requestingConsumer);

So I don't think we need extra lock here.

acquireExecutionMemory is synchronized but releaseExecutionMemory is not synchronized.

While we maintain current memory in both places, we can either

synchronized (this) on releaseExecutionMemory

or add another lock, for smaller lock granularity

Got it.

Actually, could we calculate consumers.map(.used).sum + got as the peak memory at the end of acquireExecutionMemory?

I think that's doable. That way we don't even need to maintain the current memory, instead we update the peak memory after each acquireExecutionMemory call.

Updated the code, please take another look

mridulm · 2024-07-20T00:57:43Z

I have not looked at this PR in detail, but we already have OnHeapExecutionMemory and OffHeapExecutionMemory.
They should be capturing this. no ?

liuzqt · 2024-07-20T20:13:55Z

OnHeapExecutionMemory

Hi @mridulm I think we have executor level on/off heap execution memory metrics, but not task/stage level. (I might be wrong...feel free to point me to relevant code path)

liuzqt · 2024-07-24T20:43:16Z

seeing this test failure

Failure: "connector/connect/common/src/main" had no .proto files
Error: Failure: "connector/connect/common/src/main" had no .proto files

Doesn't seem to be relevant...

mridulm

Take a look at peakExecutionMemory within spark-core.
We should be exposing the new metrics as part of the api - both at task level, and at stage level (distributions for ex).

mridulm · 2024-07-25T06:52:20Z

core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala

+  /**
+   * Peak off heap execution memory as tracked by TaskMemoryManager.
+   */
+  def peakOffHeapExecutionMemory: Long = _peakOffHeapExecutionMemory.sum


Discuss:
Is it required that peakExecutionMemory <= peakOnHeapExecutionMemory + peakOffHeapExecutionMemory ?
Any cases where this might get violated ?

I am trying to reason about completeness of these metrics (given we want to eventually deprecate the existing one).
I expect the above to hold, but want to make sure I am not missing anything.
+CC @JoshRosen

peakExecutionMemory <= peakOnHeapExecutionMemory + peakOffHeapExecutionMemory?

I think yes, becauseTaskMemoryManager.acquireExecutionMemory is the only narrow waist for any execution memory acquisition and we maintain the memory here.

Instead, the legacy peakExecutionMemory is maintained in some operators (join, agg, sort), which is totally up to operator implementation.

+1, I agree that the peakExecutionMemory <= peakOnHeapExecutionMemory + peakOffHeapExecutionMemory should hold:

If we trace through the existing callers of incPeakExecutionMemory it looks like all of the usages flow from counts that correspond to the acquireExecutionMemory waist.

liuzqt · 2024-07-25T21:36:59Z

Take a look at peakExecutionMemory within spark-core. We should be exposing the new metrics as part of the api - both at task level, and at stage level (distributions for ex).

We should definitely expose this to api. But can we land this core change and then make API/UI changes in follow PRs?

Actually I've created a sub task https://issues.apache.org/jira/browse/SPARK-48788 for that

liuzqt · 2024-08-01T04:54:07Z

Hi @JoshRosen @mridulm do you mind taking another look at this PR?

jiangxb1987 · 2024-08-07T17:57:03Z

LGTM

jiangxb1987 · 2024-08-07T18:52:57Z

Merged to master, thanks @mridulm @JoshRosen @Ngone51 for review!

cloud-fan · 2024-08-08T01:34:34Z

core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala

   */
+  // TODO: SPARK-48789: the naming is confusing since this does not really reflect the whole
+  //  execution memory. We'd better deprecate this once we have a replacement.
  def peakExecutionMemory: Long = _peakExecutionMemory.sum


How about we change its implementation to be peakOnHeapExecutionMemory + peakOffHeapExecutionMemory? The current implementation doesn't make much sense due to https://github.com/apache/spark/pull/47192/files#r1692144786

Yes I think we can plan this breaking change at spark 4.0

peakOnHeapExecutionMemory, peakOffHeapExecutionMemory can peak at different times, so we can't replace it with the sum.

Oh I see, makes sense. Let's leave it then.

dongjoon-hyun

Hi, @liuzqt , @JoshRosen, @cloud-fan , @jiangxb1987 , @Ngone51 , @mridulm .

This commit seems to cause a regression in some cases.

Specifically, ExternalAppendOnlyUnsafeRowArrayBenchmark is severely affected by this commit and is failing until now in CIs because it's almost hung.

I also verified that it's the same locally.

dongjoon-hyun · 2024-08-14T02:19:07Z

Although I tried to do a follow-up PR to provide a quick fix at TaskMemoryManager.java, it was a little tricky.

I created a reverting PR for now. It would be great if this PR lands again properly with the relevant micro-benchmark results.

Otherwise, any follow-up PR with ExternalAppendOnlyUnsafeRowArrayBenchmark benchmark result is also welcome if you can provide, @liuzqt .

liuzqt · 2024-08-14T03:15:09Z

@dongjoon-hyun thanks for the catch, I'll investigate the regression and try to fix it.

### What changes were proposed in this pull request? This reverts commit 717a6da. ### Why are the changes needed? To fix a performance regression. During the regular performance audit, - #47743 `ExternalAppendOnlyUnsafeRowArrayBenchmark` detected a performance regression caused by SPARK-48626. - #47192 ### Does this PR introduce _any_ user-facing change? No. This is not released yet. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47747 from dongjoon-hyun/SPARK-48628. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

dongjoon-hyun · 2024-08-14T04:18:52Z

Thank you so much, @liuzqt !

mridulm · 2024-08-14T16:39:31Z

Thanks for the details @dongjoon-hyun !
We can revisit this PR with the benchmark fixed as well.

### What changes were proposed in this pull request? Add task on/off heap execution memory in `TaskMetrics`, tracked in `TaskMemoryManager`, **assuming `acquireExecutionMemory` is the only one narrow waist for acquiring execution memory.** ### Why are the changes needed? Currently there is no task on/off heap execution memory metrics. There is a [peakExecutionMemory](https://github.com/apache/spark/blob/3cd35f8cb6462051c621cf49de54b9c5692aae1d/core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala#L114) metrics, however, the semantic is a confusing: it only cover the execution memory used by shuffle/join/aggregate/sort, which is accumulated in specific operators and thus not really reflect the real execution memory. Therefore it's necessary to add these two metrics. Also I created two followup sub tickets: - https://issues.apache.org/jira/browse/SPARK-48788 : accumulate task metrics in stage, and display in Spark UI - https://issues.apache.org/jira/browse/SPARK-48789 : deprecate `peakExecutionMemory` once we have replacement for it. The ultimate goal is to have these two metrics ready (as accumulated stage metrics in Spark UI as well) and deprecate `peakExecutionMemory`. ### Does this PR introduce _any_ user-facing change? Supposedly no. But two followup sub tickets will have user-facing change: new metrics exposed to Spark UI, and old metrics deprecation. ### How was this patch tested? new test ### Was this patch authored or co-authored using generative AI tooling? NO Closes apache#47192 from liuzqt/SPARK-48628. Authored-by: Ziqi Liu <[email protected]> Signed-off-by: Xingbo Jiang <[email protected]>

### What changes were proposed in this pull request? This PR is trying to revive #47192, which was [reverted](#47747) due to regression in `ExternalAppendOnlyUnsafeRowArrayBenchmark`. **Root cause** We eventually decided to aggregate peak memory usage from all consumers on each `acquireExecutionMemory` invocation. (see [this discussion](#47192 (comment))), which is O(n) complexity where `n` is the number of consumers. `ExternalAppendOnlyUnsafeRowArrayBenchmark` is implemented in a way that all iterations are run in a single task context, therefore the number of consumers is exploding. Notice that `TaskMemoryManager.consumers` is never cleaned up the whole lifecycle, and `TaskMemoryManager.acquireExecutionMemory` is a very frequent operation, doing a linear complexity(in terms of number of consumers) operation here might not be a good choice. This benchmark might be a corner case, but it's still possible to have a large number of consumers in a large query plan. I fallback to the previous implementation: maintain current execution memory with an extra lock. cc Ngone51 #### Benchmark result [ExternalAppendOnlyUnsafeRowArrayBenchmark-results](https://github.com/liuzqt/spark/actions/runs/10415213026) [ExternalAppendOnlyUnsafeRowArrayBenchmark-jdk21-results](https://github.com/liuzqt/spark/actions/runs/10414246805) ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? New unit tests. ### Was this patch authored or co-authored using generative AI tooling? NO Closes #47776 from liuzqt/SPARK-48628. Authored-by: Ziqi Liu <[email protected]> Signed-off-by: Josh Rosen <[email protected]>

### What changes were proposed in this pull request? This PR is trying to revive apache/spark#47192, which was [reverted](apache/spark#47747) due to regression in `ExternalAppendOnlyUnsafeRowArrayBenchmark`. **Root cause** We eventually decided to aggregate peak memory usage from all consumers on each `acquireExecutionMemory` invocation. (see [this discussion](apache/spark#47192 (comment))), which is O(n) complexity where `n` is the number of consumers. `ExternalAppendOnlyUnsafeRowArrayBenchmark` is implemented in a way that all iterations are run in a single task context, therefore the number of consumers is exploding. Notice that `TaskMemoryManager.consumers` is never cleaned up the whole lifecycle, and `TaskMemoryManager.acquireExecutionMemory` is a very frequent operation, doing a linear complexity(in terms of number of consumers) operation here might not be a good choice. This benchmark might be a corner case, but it's still possible to have a large number of consumers in a large query plan. I fallback to the previous implementation: maintain current execution memory with an extra lock. cc Ngone51 #### Benchmark result [ExternalAppendOnlyUnsafeRowArrayBenchmark-results](https://github.com/liuzqt/spark/actions/runs/10415213026) [ExternalAppendOnlyUnsafeRowArrayBenchmark-jdk21-results](https://github.com/liuzqt/spark/actions/runs/10414246805) ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? New unit tests. ### Was this patch authored or co-authored using generative AI tooling? NO Closes #47776 from liuzqt/SPARK-48628. Authored-by: Ziqi Liu <[email protected]> Signed-off-by: Josh Rosen <[email protected]>

add task peak on/off heap memory metrics

2c59644

github-actions bot added the CORE label Jul 3, 2024

fix test

672d058

liuzqt changed the title ~~[SPARK-48628] Add task peak on/off heap memory metrics~~ [SPARK-48628][CORE] Add task peak on/off heap memory metrics Jul 3, 2024

rename

89a703f

JoshRosen reviewed Jul 5, 2024

View reviewed changes

core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java Show resolved Hide resolved

JoshRosen reviewed Jul 5, 2024

View reviewed changes

core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala Show resolved Hide resolved

JoshRosen reviewed Jul 5, 2024

View reviewed changes

core/src/test/scala/org/apache/spark/memory/MemoryManagerSuite.scala Outdated Show resolved Hide resolved

JoshRosen requested a review from mridulm July 5, 2024 20:07

JoshRosen reviewed Jul 5, 2024

View reviewed changes

liuzqt added 2 commits July 5, 2024 13:19

address comments

70bbdc5

make peak*memory volatile

12a849b

liuzqt requested a review from JoshRosen July 8, 2024 18:16

jiangxb1987 approved these changes Jul 12, 2024

View reviewed changes

Ngone51 reviewed Jul 15, 2024

View reviewed changes

liuzqt added 2 commits July 15, 2024 17:22

update JsonProtocol

dad880f

maintain peak memory in task memory manager level

8cec467

liuzqt force-pushed the SPARK-48628 branch from d49a93a to 8cec467 Compare July 16, 2024 03:26

liuzqt requested a review from Ngone51 July 16, 2024 03:50

Ngone51 reviewed Jul 16, 2024

View reviewed changes

liuzqt requested a review from Ngone51 July 18, 2024 00:52

update

163fe06

liuzqt force-pushed the SPARK-48628 branch from d447a4b to 163fe06 Compare July 18, 2024 00:53

Ngone51 approved these changes Jul 23, 2024

View reviewed changes

empty commit

110ccae

mridulm reviewed Jul 25, 2024

View reviewed changes

Merge remote-tracking branch 'upstream/master' into SPARK-48628

07c2713

liuzqt requested a review from mridulm August 1, 2024 04:52

jiangxb1987 closed this in 717a6da Aug 7, 2024

cloud-fan reviewed Aug 8, 2024

View reviewed changes

dongjoon-hyun reviewed Aug 14, 2024

View reviewed changes

dongjoon-hyun mentioned this pull request Aug 14, 2024

Revert "[SPARK-48628][CORE] Add task peak on/off heap memory metrics" #47747

Closed

liuzqt mentioned this pull request Aug 15, 2024

[SPARK-48628][CORE] Add task peak on/off heap memory metrics #47776

Closed

[SPARK-48628][CORE] Add task peak on/off heap memory metrics #47192

[SPARK-48628][CORE] Add task peak on/off heap memory metrics #47192

Uh oh!

Conversation

liuzqt commented Jul 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JoshRosen left a comment

Choose a reason for hiding this comment

Uh oh!

jiangxb1987 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JoshRosen Jul 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mridulm commented Jul 20, 2024

Uh oh!

liuzqt commented Jul 20, 2024

Uh oh!

liuzqt commented Jul 24, 2024

Uh oh!

mridulm left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liuzqt commented Jul 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liuzqt commented Aug 1, 2024

Uh oh!

jiangxb1987 commented Aug 7, 2024

Uh oh!

jiangxb1987 commented Aug 7, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mridulm Aug 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Aug 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

liuzqt commented Jul 3, 2024 •

edited

Loading

JoshRosen Jul 16, 2024 •

edited

Loading

mridulm left a comment •

edited

Loading

liuzqt commented Jul 25, 2024 •

edited

Loading

mridulm Aug 8, 2024 •

edited

Loading

dongjoon-hyun commented Aug 14, 2024 •

edited

Loading