Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Dec 5, 2022

What changes were proposed in this pull request?

This PR aims to prevent getReusablePVCs from choosing recently created PVCs in the very previous batch by excluding newly created PVCs whose creation time is within spark.kubernetes.allocation.batch.delay.

Why are the changes needed?

In case of slow K8s control plane situation where spark.kubernetes.allocation.batch.delay is too short relatively or spark.kubernetes.executor.enablePollingWithResourceVersion=true is used, onNewSnapshots may not return the full list of executor pods created by the previous batch. This sometimes makes Spark driver think the PVCs in the previous batch are reusable for the next batch.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Pass the CIs with the newly created test case.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-41388][K8S] getReusablePVCs should ignore recently created PVCs in the previous batch [SPARK-41388][K8S] getReusablePVCs should ignore recently created PVCs in the previous batch Dec 5, 2022
.thenReturn(Seq(persistentVolumeClaim("pvc-0", "gp2", "200Gi")).asJava)
val pvc = persistentVolumeClaim("pvc-0", "gp2", "200Gi")
pvc.getMetadata
.setCreationTimestamp(Instant.now().minus(podAllocationDelay + 1, MILLIS).toString)
Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Dec 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a previous test case for SPARK-35416: Support PersistentVolumeClaim Reuse.

Since our test framework does't fill CreationTimestamp, this PR added it properly.

.withNewMetadata()
.withName(claimName)
.addToLabels(SPARK_APP_ID_LABEL, TEST_SPARK_APP_ID)
.endMetadata()

@dongjoon-hyun
Copy link
Member Author

Could you review this when you have some time, @viirya ?

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rationale make sense.

@dongjoon-hyun
Copy link
Member Author

Thank you so much!

@dongjoon-hyun
Copy link
Member Author

All tests passed. Merged to master/3.3/3.2.

dongjoon-hyun added a commit that referenced this pull request Dec 5, 2022
…VCs in the previous batch

This PR aims to prevent `getReusablePVCs` from choosing recently created PVCs in the very previous batch by excluding newly created PVCs whose creation time is within `spark.kubernetes.allocation.batch.delay`.

In case of slow K8s control plane situation where `spark.kubernetes.allocation.batch.delay` is too short relatively or `spark.kubernetes.executor.enablePollingWithResourceVersion=true` is used, `onNewSnapshots` may not return the full list of executor pods created by the previous batch. This sometimes makes Spark driver think the PVCs in the previous batch are reusable for the next batch.

No.

Pass the CIs with the newly created test case.

Closes #38912 from dongjoon-hyun/SPARK-41388.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit e234cd8)
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun added a commit that referenced this pull request Dec 5, 2022
…VCs in the previous batch

This PR aims to prevent `getReusablePVCs` from choosing recently created PVCs in the very previous batch by excluding newly created PVCs whose creation time is within `spark.kubernetes.allocation.batch.delay`.

In case of slow K8s control plane situation where `spark.kubernetes.allocation.batch.delay` is too short relatively or `spark.kubernetes.executor.enablePollingWithResourceVersion=true` is used, `onNewSnapshots` may not return the full list of executor pods created by the previous batch. This sometimes makes Spark driver think the PVCs in the previous batch are reusable for the next batch.

No.

Pass the CIs with the newly created test case.

Closes #38912 from dongjoon-hyun/SPARK-41388.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit e234cd8)
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 651f5da)
Signed-off-by: Dongjoon Hyun <[email protected]>
@dongjoon-hyun dongjoon-hyun deleted the SPARK-41388 branch December 5, 2022 17:02
beliefer pushed a commit to beliefer/spark that referenced this pull request Dec 18, 2022
…VCs in the previous batch

### What changes were proposed in this pull request?

This PR aims to prevent `getReusablePVCs` from choosing recently created PVCs in the very previous batch by excluding newly created PVCs whose creation time is within `spark.kubernetes.allocation.batch.delay`.

### Why are the changes needed?

In case of slow K8s control plane situation where `spark.kubernetes.allocation.batch.delay` is too short relatively or `spark.kubernetes.executor.enablePollingWithResourceVersion=true` is used, `onNewSnapshots` may not return the full list of executor pods created by the previous batch. This sometimes makes Spark driver think the PVCs in the previous batch are reusable for the next batch.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs with the newly created test case.

Closes apache#38912 from dongjoon-hyun/SPARK-41388.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
sunchao pushed a commit to sunchao/spark that referenced this pull request Jun 2, 2023
…VCs in the previous batch

### What changes were proposed in this pull request?

This PR aims to prevent `getReusablePVCs` from choosing recently created PVCs in the very previous batch by excluding newly created PVCs whose creation time is within `spark.kubernetes.allocation.batch.delay`.

### Why are the changes needed?

In case of slow K8s control plane situation where `spark.kubernetes.allocation.batch.delay` is too short relatively or `spark.kubernetes.executor.enablePollingWithResourceVersion=true` is used, `onNewSnapshots` may not return the full list of executor pods created by the previous batch. This sometimes makes Spark driver think the PVCs in the previous batch are reusable for the next batch.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs with the newly created test case.

Closes apache#38912 from dongjoon-hyun/SPARK-41388.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit e234cd8)
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants