[SPARK-26560][SQL]:Repeat select on HiveUDF fails #23921

nivo091 · 2019-02-28T14:48:32Z

What changes were proposed in this pull request?

The classloader of spark-shell is IMainsTranslatingClassLoader which is loaded from IMain (scala.tools.nsc.interpreter). But on the first select, it registers the function and loads the hiveUDF jar to sparkContext classloader which is NonClosableMutuableClassLoader. While selecting the function on the second time, function is already registered so it tries to fetch from IMainTranslatingClassLoader which is giving analysis exception.

Changing of the classLoader of currentThread to NonClosableMutuableclassLoader will solve this issue.

HyukjinKwon · 2019-03-01T05:42:35Z

Can you fix the PR title? It sounds like we have a problem in SELECT of SparkSQL. The problem is about Hive UDF in a specific case.

HyukjinKwon · 2019-03-01T05:44:30Z

Also, please describe how the current PR fixes the issue in the PR description.

nivo091 · 2019-03-01T06:58:17Z

I have updated the PR description and title, please review. @HyukjinKwon

nivo091 · 2019-03-01T07:31:07Z

@HyukjinKwon This issue happens only in Spark-shell, that's why I added in the title. Is that not required ?

sujith71955 · 2019-03-01T08:53:59Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala

pull up the flower bracket inline with finally

sujith71955 · 2019-03-01T08:59:24Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala

is this try block is required now? please check once again

we can remove this inside try and put the catch block to the outer try.

sujith71955 · 2019-03-02T17:54:48Z

@nivo091 Please correct the PR title format seems to be not in same as standard mentioned by the community.

nivo091 · 2019-03-05T06:44:41Z

@sujith71955 : Handled, thanks

sujith71955 · 2019-03-07T04:38:02Z

cc @srowen @HyukjinKwon

nivo091 · 2019-03-14T11:18:32Z

cc @xiao Li

SparkQA · 2019-03-14T14:32:50Z

Test build #4621 has finished for PR 23921 at commit 4855060.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-03-18T05:54:54Z

Can you make the PR description concise? It's difficult to read and follow. Why does the classloader change in the second time? Can you also point out the related codes?

nivo091 · 2019-03-19T14:47:58Z

Spark-shell have the IMainTranslatingClassLoader which is loaded from IMain (scala.tools.nsc.interpreter).
For the first select of function, it loads the hiveUDF jar in the below line.

spark/sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala

Lines 168 to 169 in e402de5

    
           session.sharedState.jarClassLoader.addURL(jarURL) 
        
           Thread.currentThread().setContextClassLoader(session.sharedState.jarClassLoader)

AmplabJenkins · 2019-09-16T18:15:41Z

Can one of the admins verify this patch?

HeartSaVioR · 2019-12-27T05:02:21Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala

-    Try(super.makeFunctionExpression(name, clazz, input)).getOrElse {
-      var udfExpr: Option[Expression] = None
-      try {
+    val originalClassLoader = Thread.currentThread().getContextClassLoader()


It doesn't need to modify original code but just switch context classloader. Then, use Utils.withContextClassLoader and just pass original code into fn.

HeartSaVioR · 2019-12-27T05:08:41Z

Btw, please follow the template of PR description. And we describe PR title for what the patch fixes instead of what was the bug.

HeartSaVioR · 2019-12-27T13:52:29Z

I'm taking this over, as it's a bit old and the test should be added as well. Please refer #27025.
I've retained the commit to give proper credit. Thanks!

…ardless of current thread context classloader ### What changes were proposed in this pull request? This patch is based on #23921 but revised to be simpler, as well as adds UT to test the behavior. (This patch contains the commit from #23921 to retain credit.) Spark loads new JARs for `ADD JAR` and `CREATE FUNCTION ... USING JAR` into jar classloader in shared state, and changes current thread's context classloader to jar classloader as many parts of remaining codes rely on current thread's context classloader. This would work if the further queries will run in same thread and there's no change on context classloader for the thread, but once the context classloader of current thread is switched back by various reason, Spark fails to create instance of class for the function. This bug mostly affects spark-shell, as spark-shell will roll back current thread's context classloader at every prompt. But it may also affects the case of job-server, where the queries may be running in multiple threads. This patch fixes the issue via switching the context classloader to the classloader which loads the class. Hopefully FunctionBuilder created by `makeFunctionBuilder` has the information of Class as a part of closure, hence the Class itself can be provided regardless of current thread's context classloader. ### Why are the changes needed? Without this patch, end users cannot execute Hive UDF using JAR twice in spark-shell. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? New UT. Closes #27025 from HeartSaVioR/SPARK-26560-revised. Lead-authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]> Co-authored-by: nivo091 <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…ardless of current thread context classloader This patch is based on apache#23921 but revised to be simpler, as well as adds UT to test the behavior. (This patch contains the commit from apache#23921 to retain credit.) Spark loads new JARs for `ADD JAR` and `CREATE FUNCTION ... USING JAR` into jar classloader in shared state, and changes current thread's context classloader to jar classloader as many parts of remaining codes rely on current thread's context classloader. This would work if the further queries will run in same thread and there's no change on context classloader for the thread, but once the context classloader of current thread is switched back by various reason, Spark fails to create instance of class for the function. This bug mostly affects spark-shell, as spark-shell will roll back current thread's context classloader at every prompt. But it may also affects the case of job-server, where the queries may be running in multiple threads. This patch fixes the issue via switching the context classloader to the classloader which loads the class. Hopefully FunctionBuilder created by `makeFunctionBuilder` has the information of Class as a part of closure, hence the Class itself can be provided regardless of current thread's context classloader. Without this patch, end users cannot execute Hive UDF using JAR twice in spark-shell. No. New UT. Closes apache#27025 from HeartSaVioR/SPARK-26560-revised. Lead-authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]> Co-authored-by: nivo091 <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…r regardless of current thread context classloader ### What changes were proposed in this pull request? This patch is based on #23921 but revised to be simpler, as well as adds UT to test the behavior. (This patch contains the commit from #23921 to retain credit.) Spark loads new JARs for `ADD JAR` and `CREATE FUNCTION ... USING JAR` into jar classloader in shared state, and changes current thread's context classloader to jar classloader as many parts of remaining codes rely on current thread's context classloader. This would work if the further queries will run in same thread and there's no change on context classloader for the thread, but once the context classloader of current thread is switched back by various reason, Spark fails to create instance of class for the function. This bug mostly affects spark-shell, as spark-shell will roll back current thread's context classloader at every prompt. But it may also affects the case of job-server, where the queries may be running in multiple threads. This patch fixes the issue via switching the context classloader to the classloader which loads the class. Hopefully FunctionBuilder created by `makeFunctionBuilder` has the information of Class as a part of closure, hence the Class itself can be provided regardless of current thread's context classloader. ### Why are the changes needed? Without this patch, end users cannot execute Hive UDF using JAR twice in spark-shell. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? New UT. Closes #27075 from HeartSaVioR/SPARK-26560-branch-2.4. Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

nivo091 changed the title ~~SPARK-26560:Repeat select fail fix~~ SPARK-26560:Repeat select on HiveUDF fails in Spark-shell Mar 1, 2019

HyukjinKwon changed the title ~~SPARK-26560:Repeat select on HiveUDF fails in Spark-shell~~ SPARK-26560:Repeat select on HiveUDF fails Mar 1, 2019

sujith71955 reviewed Mar 1, 2019

View reviewed changes

nivo091 force-pushed the udf_fix branch from cb14734 to 22df313 Compare March 1, 2019 11:08

nivo091 changed the title ~~SPARK-26560:Repeat select on HiveUDF fails~~ [SPARK-26560][SQL]:Repeat select on HiveUDF fails Mar 5, 2019

nivo091 force-pushed the udf_fix branch from 22df313 to 4855060 Compare March 5, 2019 06:43

[SPARK-26560][SQL]:Repeat select on HiveUDF fails

51405b0

nivo091 force-pushed the udf_fix branch from 4855060 to 51405b0 Compare March 19, 2019 15:00

dongjoon-hyun added the SQL label Jun 14, 2019

maropu mentioned this pull request Dec 15, 2019

[SPARK-30260][SQL] Spark-Shell throw ClassNotFoundException exception for more than one statement to use UDF jar #26888

Closed

HeartSaVioR reviewed Dec 27, 2019

View reviewed changes

HeartSaVioR mentioned this pull request Dec 27, 2019

[SPARK-26560][SQL] Spark should be able to run Hive UDF using jar regardless of current thread context classloader #27025

Closed

HeartSaVioR mentioned this pull request Jan 2, 2020

[SPARK-26560][SQL][2.4] Spark should be able to run Hive UDF using jar regardless of current thread context classloader #27075

Closed

HyukjinKwon closed this Jan 16, 2020

[SPARK-26560][SQL]:Repeat select on HiveUDF fails #23921

[SPARK-26560][SQL]:Repeat select on HiveUDF fails #23921

Uh oh!

Conversation

nivo091 commented Feb 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Uh oh!

HyukjinKwon commented Mar 1, 2019

Uh oh!

HyukjinKwon commented Mar 1, 2019

Uh oh!

nivo091 commented Mar 1, 2019

Uh oh!

nivo091 commented Mar 1, 2019

Uh oh!

sujith71955 Mar 1, 2019

Choose a reason for hiding this comment

Uh oh!

sujith71955 Mar 1, 2019

Choose a reason for hiding this comment

Uh oh!

nivo091 Mar 1, 2019

Choose a reason for hiding this comment

Uh oh!

sujith71955 commented Mar 2, 2019

Uh oh!

nivo091 commented Mar 5, 2019

Uh oh!

sujith71955 commented Mar 7, 2019

Uh oh!

nivo091 commented Mar 14, 2019

Uh oh!

SparkQA commented Mar 14, 2019

Uh oh!

HyukjinKwon commented Mar 18, 2019

Uh oh!

nivo091 commented Mar 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AmplabJenkins commented Sep 16, 2019

Uh oh!

HeartSaVioR Dec 27, 2019

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR commented Dec 27, 2019

Uh oh!

HeartSaVioR commented Dec 27, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

nivo091 commented Feb 28, 2019 •

edited

Loading

nivo091 commented Mar 19, 2019 •

edited

Loading