[SPARK-27666][CORE] Do not release lock while TaskContext already completed #24699

Ngone51 · 2019-05-24T14:44:28Z

What changes were proposed in this pull request?

PythonRunner uses an asynchronous way, which produces elements in WriteThread but consumes elements in another thread, to execute task. When child operator, like take()/first(), does not consume all elements produced by WriteThread, task would finish before WriteThread and releases all locks on blocks. However, WriteThread would continue to produce elements by pulling elements from parent operator until it exhausts all elements. And at the time WriteThread exhausts all elements, it will try to release the corresponding block but hit a AssertionError since task has already released that lock previously.

#24542 previously fix this by catching AssertionError, so that we won't fail our executor.

However, when not using PySpark, issue still exists when user implements a custom RDD or task, which spawn a separate child thread to consume iterator from a cached parent RDD. Below is a demo which could easily reproduce the issue.

    val rdd0 = sc.parallelize(Range(0, 10), 1).cache()
    rdd0.collect()
    rdd0.mapPartitions { iter =>
      val t = new Thread(new Runnable {
        override def run(): Unit = {
          while(iter.hasNext) {
            println(iter.next())
            Thread.sleep(1000)
          }
        }
      })
      t.setDaemon(false)
      t.start()
      Iterator(0)
    }.collect()
    Thread.sleep(100000)

So, if we could prevent the separate thread from releasing lock on block when TaskContext has already completed, we won't hit this issue again.

How was this patch tested?

Added in new unit test in RDDSuite.

viirya · 2019-05-24T15:24:30Z

core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala

    def shutdownOnTaskCompletion() {
      assert(context.isCompleted)
      this.interrupt()
+      this.join()


Ur..I think this change just hangs the execution.

This just blocks the current Thread and wait for interrupt to take effect?

It causes a deadlock on the lock on task context.

Hi, @viirya. Can you explain more about 'deadlock' please ? I just notice that it's possible we fall into deadlock if WritedThread trying to acquire lock on TaskContext after this.join() invoked, but I haven't found where it did happens.

For example, when the writer thread is reading from a cached relation, it needs to add task complete listener. At the moment, it needs to acquire the lock on task context.

hmm...this.join() is called within task runner thread while holding the lock on task context. And once join() is called, WriteThread would throw InterruptedException, then, we run into case _: InterruptedException => code branch within WriteThread. In this branch, if we acquire the lock on task context, we'll fall into deadlock, otherwise, WriteThread exit normally. Right ?

And if WriteThread wants to acquire the lock on task context(as you mentioned, adding task complete listener due to read from a cached relation) out side of above code branch (or say, within its normal execution process), it just preform a normal lock competition between WriteThread and task runner thread. WDYT ?

Please correct me if I mis-understanding on some points.

SparkQA · 2019-05-24T17:07:44Z

Test build #105762 has finished for PR 24699 at commit 74bbbad.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Ngone51 · 2019-05-25T08:34:51Z

cc @cloud-fan

cloud-fan · 2019-05-27T06:01:44Z

After a second thought, it seems overkill to block the main thread and wait for the python writer thread to exit. If something bad happens we may block the main thread for a long time or even hang.

Perhaps a better solution is to add releaseIfLocked, which does nothing if the lock is already released.

This reverts commit 74bbbad.

Ngone51 · 2019-05-28T16:29:23Z

Since killing PythonRunner WriteThread may be overkill and would face uncontrollable risk, after discussing with @cloud-fan offline, we decide to use another solution, which has updated in PR description.

And, for simple, I changed original JIRA title from "Stop PythonRunner's WriteThread immediately when task finishes" to current one.

ping @cloud-fan @jiangxb1987 @viirya

cloud-fan · 2019-05-28T17:09:25Z

core/src/main/scala/org/apache/spark/storage/BlockManager.scala

+    // SPARK-27666. Child thread spawned from task thread could produce race condition
+    // on block lock releasing. We should prevent child thread from releasing un-locked
+    // block when task thread has already finished.
+    if (taskContext.isDefined && taskContext.map(_.isCompleted()).get) {


nit: taskContext.isDefined && taskContext.get.isCompleted

cloud-fan · 2019-05-28T17:11:30Z

core/src/main/scala/org/apache/spark/storage/BlockManager.scala

-    blockInfoManager.unlock(blockId, taskAttemptId)
+  def releaseLock(blockId: BlockId, taskContext: Option[TaskContext] = None): Unit = {
+    val taskAttemptId = taskContext.map(_.taskAttemptId())
+    // SPARK-27666. Child thread spawned from task thread could produce race condition


A simpler explanation: When a task completes, Spark automatically releases all the blocks locked by this task. We should not release any blocks for a task that is already completed.

cloud-fan · 2019-05-28T17:19:12Z

core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala

+      t.start()
+      Iterator(0)
+    }.collect()
+    Thread.sleep(10 * 150)


Can we use a more reliable way to test it? We can set up a CountDownLatch, and count it down at the end of the thread. Then we wait for the count down at the end of the test.

CountDownLatch can't be serialized in Task, so it doesn't work.

Why not add a TaskCompletionListener and wait for the listener to be triggered? Something like:

eventually(timeout(10.seconds)) { assert(// Task completion triggered.) }

IIUC, TaskCompletionListener will be called after collect() done, but thread t will be still running at that time.

Make sense. Also curious how did the magic number 10 * 150 come out?

Thread.sleep(10 * 150) > iter.size * Thread(100)

we shouldn't use sleep in tests, as the test will become flaky sooner or later. If CountDownLatch doesn't work, can we use Spark Accumulator as signals?

I tried Accumulator previously, it looks like child thread continuously add acc by 1 every 100ms. And test thread waits until acc reaches 10. But it doesn't work, either. Because we want the signal comes from that child thread, but acc always comes with finished task. Unfortunately, in this case, task finished before that child thread.

If this kind of end-to-end test doesn't work without sleep, how about we have an unit test for releaseLock behavior when task context is completed?

How about this way eba8e96 ? I test it with/without this PR, and it works. @cloud-fan @jiangxb1987 @viirya

jiangxb1987 · 2019-05-28T17:29:56Z

Have you considered change the logic in https://github.com/apache/spark/blob/e9f3f62b2c0f521f3cc23fef381fc6754853ad4f/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L745~L747 ? Skip release lock if TaskContext has completed shall also resolve the issue, or there are something I missed?

SparkQA · 2019-05-28T17:46:28Z

Test build #105875 has finished for PR 24699 at commit 26a7dd2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

Ngone51 · 2019-05-29T01:44:52Z

Skip release lock if TaskContext has completed shall also resolve the issue

Do you @jiangxb1987 mean like this ?

val ci = CompletionIterator[Any, Iterator[Any]](iter, {
 if (!taskContext.isCompleted()) {
   releaseLock(blockId, taskAttemptId)
 }
})

I was thinking about it, but for:

spark/core/src/main/scala/org/apache/spark/storage/BlockManager.scala

Lines 764 to 766 in e9f3f62

    
           val ci = CompletionIterator[Any, Iterator[Any]](iterToReturn, { 
        
             releaseLockAndDispose(blockId, diskData, taskAttemptId) 
        
           })

it seems we can't wrap an if condition around releaseLockAndDispose in the same way. We have to dispose data any way. Right ? So, we need to pass taskContext into releaseLockAndDispose. In releaseLockAndDispose:

spark/core/src/main/scala/org/apache/spark/storage/BlockManager.scala

Lines 1666 to 1672 in e9f3f62

    
           def releaseLockAndDispose( 
        
               blockId: BlockId, 
        
               data: BlockData, 
        
               taskAttemptId: Option[Long] = None): Unit = { 
        
             releaseLock(blockId, taskAttemptId) 
        
             data.dispose() 
        
           }

We could also warp a if condition around releaseLock. But, I think it may be better to reduce duplicate code, so, I move the logic into releaseLock itself finally.

jiangxb1987 · 2019-05-29T01:51:43Z

We can go either way, both looks fine to me. I would refactor releaseLockAndDispose() further to take into account whether task has been completed, but the current change should resolved the issue too, I'm just afraid it's kinda overkill to pass in the whole TaskContext.

viirya · 2019-05-29T02:49:40Z

core/src/main/scala/org/apache/spark/storage/BlockManager.scala

-   * The param `taskAttemptId` should be passed in case we can't get the correct TID from
-   * TaskContext, for example, the input iterator of a cached RDD iterates to the end in a child
+   * The param `taskContext` should be passed in case we can't get the correct TaskContext
+   * for example, the input iterator of a cached RDD iterates to the end in a child


nit: a missing , before for example.

viirya · 2019-05-29T02:50:44Z

core/src/main/scala/org/apache/spark/storage/BlockManager.scala

+    // on block lock releasing. We should prevent child thread from releasing un-locked
+    // block when task thread has already finished.
+    if (taskContext.isDefined && taskContext.map(_.isCompleted()).get) {
+      logWarning(s"Task $taskAttemptId already completed, not releasing lock for $blockId")


${taskAttemptId.get}

Thanks for catching this.

viirya · 2019-05-29T03:00:42Z

core/src/main/scala/org/apache/spark/storage/BlockManager.scala

  }

  /**
   * Release a lock on the given block with explicit TID.


with explicit TaskContext

SparkQA · 2019-05-30T03:47:52Z

Test build #105938 has finished for PR 24699 at commit e6a97b3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2019-05-30T04:21:07Z

retest this please

SparkQA · 2019-05-30T06:28:45Z

Test build #105945 has finished for PR 24699 at commit e6a97b3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Ngone51 · 2019-06-11T12:04:37Z

ping @cloud-fan @jiangxb1987 @viirya

any more comments ?

jiangxb1987 · 2019-06-13T04:37:33Z

LGTM

SparkQA · 2019-06-13T14:19:21Z

Test build #106468 has finished for PR 24699 at commit eba8e96.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-06-13T16:46:54Z

Test build #106469 has finished for PR 24699 at commit 4c64b03.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-06-17T07:54:08Z

core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala

+    }.collect()
+    val tmx = ManagementFactory.getThreadMXBean
+    var t = tmx.getThreadInfo(tid.value)
+    // getThreadInfo() will return null after child thread `t` died


This reminds me of one thing: the tests are run in local mode, so driver and executor are in the same JVM. Seems CountDownLatch should work? What we really need is: the thread in the task sends a signal to the main thread.

Seems CountDownLatch should work?

CountDownLatch doesn't work because it is not serializable.

viirya · 2019-06-17T12:02:27Z

core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala

+    while (t != null && t.getThreadState != Thread.State.TERMINATED) {
+      t = tmx.getThreadInfo(tid.value)
+    }
+  }


Can we use eventually with timeout instead of a while loop?

eventually(timeout(10.seconds)) { val t = tmx.getThreadInfo(tid.value) assert(t == null || t.getThreadState == Thread.State.TERMINATED) }

SparkQA · 2019-06-17T15:16:45Z

Test build #106584 has finished for PR 24699 at commit 7f0f360.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class RDDSuite extends SparkFunSuite with SharedSparkContext with Eventually

cloud-fan · 2019-06-18T02:16:02Z

thanks, merging to master!

Ngone51 · 2019-06-18T06:03:28Z

Thanks @cloud-fan @jiangxb1987 @viirya

shutdown pythonrunner write-thread when task finish

74bbbad

viirya reviewed May 24, 2019

View reviewed changes

Revert "shutdown pythonrunner write-thread when task finish"

59e6dda

This reverts commit 74bbbad.

Ngone51 changed the title ~~[SPARK-27666][CORE] Stop PythonRunner's WriteThread immediately when task finishes~~ [SPARK-27666][CORE] Do not release lock while TaskContext already completed May 28, 2019

Do not release lock while TaskContext already completed

26a7dd2

cloud-fan reviewed May 28, 2019

View reviewed changes

viirya reviewed May 29, 2019

View reviewed changes

address comment

e6a97b3

improve test

eba8e96

scalastyle

4c64b03

dongjoon-hyun added the SQL label Jun 14, 2019

cloud-fan reviewed Jun 17, 2019

View reviewed changes

viirya reviewed Jun 17, 2019

View reviewed changes

use eventually instead of while

7f0f360

viirya approved these changes Jun 18, 2019

View reviewed changes

cloud-fan closed this in bb17aec Jun 18, 2019

timarmstrong mentioned this pull request Oct 11, 2021

[SPARK-37088][PYSPARK][SQL] Writer thread must not access input after task completion listener returns #34245

Closed

[SPARK-27666][CORE] Do not release lock while TaskContext already completed #24699

[SPARK-27666][CORE] Do not release lock while TaskContext already completed #24699

Uh oh!

Conversation

Ngone51 commented May 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 24, 2019

Uh oh!

Ngone51 commented May 25, 2019

Uh oh!

cloud-fan commented May 27, 2019

Uh oh!

Ngone51 commented May 28, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ngone51 Jun 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiangxb1987 commented May 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented May 28, 2019

Uh oh!

Ngone51 commented May 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jiangxb1987 commented May 29, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 30, 2019

Uh oh!

viirya commented May 30, 2019

Uh oh!

SparkQA commented May 30, 2019

Uh oh!

Ngone51 commented Jun 11, 2019

Ngone51 commented May 24, 2019 •

edited

Loading

Ngone51 Jun 13, 2019 •

edited

Loading

jiangxb1987 commented May 28, 2019 •

edited

Loading

Ngone51 commented May 29, 2019 •

edited

Loading

viirya Jun 17, 2019 •

edited

Loading