Skip to content

Conversation

@nivo091
Copy link
Contributor

@nivo091 nivo091 commented Feb 28, 2019

What changes were proposed in this pull request?

The classloader of spark-shell is IMainsTranslatingClassLoader which is loaded from IMain (scala.tools.nsc.interpreter). But on the first select, it registers the function and loads the hiveUDF jar to sparkContext classloader which is NonClosableMutuableClassLoader. While selecting the function on the second time, function is already registered so it tries to fetch from IMainTranslatingClassLoader which is giving analysis exception.

Changing of the classLoader of currentThread to NonClosableMutuableclassLoader will solve this issue.

@HyukjinKwon
Copy link
Member

Can you fix the PR title? It sounds like we have a problem in SELECT of SparkSQL. The problem is about Hive UDF in a specific case.

@HyukjinKwon
Copy link
Member

Also, please describe how the current PR fixes the issue in the PR description.

@nivo091 nivo091 changed the title SPARK-26560:Repeat select fail fix SPARK-26560:Repeat select on HiveUDF fails in Spark-shell Mar 1, 2019
@nivo091
Copy link
Contributor Author

nivo091 commented Mar 1, 2019

I have updated the PR description and title, please review. @HyukjinKwon

@HyukjinKwon HyukjinKwon changed the title SPARK-26560:Repeat select on HiveUDF fails in Spark-shell SPARK-26560:Repeat select on HiveUDF fails Mar 1, 2019
@nivo091
Copy link
Contributor Author

nivo091 commented Mar 1, 2019

@HyukjinKwon This issue happens only in Spark-shell, that's why I added in the title. Is that not required ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pull up the flower bracket inline with finally

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this try block is required now? please check once again

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can remove this inside try and put the catch block to the outer try.

@sujith71955
Copy link
Contributor

@nivo091 Please correct the PR title format seems to be not in same as standard mentioned by the community.

@nivo091 nivo091 changed the title SPARK-26560:Repeat select on HiveUDF fails [SPARK-26560][SQL]:Repeat select on HiveUDF fails Mar 5, 2019
@nivo091
Copy link
Contributor Author

nivo091 commented Mar 5, 2019

@sujith71955 : Handled, thanks

@sujith71955
Copy link
Contributor

cc @srowen @HyukjinKwon

@nivo091
Copy link
Contributor Author

nivo091 commented Mar 14, 2019

cc @xiao Li

@SparkQA
Copy link

SparkQA commented Mar 14, 2019

Test build #4621 has finished for PR 23921 at commit 4855060.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Can you make the PR description concise? It's difficult to read and follow. Why does the classloader change in the second time? Can you also point out the related codes?

@nivo091
Copy link
Contributor Author

nivo091 commented Mar 19, 2019

Spark-shell have the IMainTranslatingClassLoader which is loaded from IMain (scala.tools.nsc.interpreter).
For the first select of function, it loads the hiveUDF jar in the below line.

session.sharedState.jarClassLoader.addURL(jarURL)
Thread.currentThread().setContextClassLoader(session.sharedState.jarClassLoader)

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

Try(super.makeFunctionExpression(name, clazz, input)).getOrElse {
var udfExpr: Option[Expression] = None
try {
val originalClassLoader = Thread.currentThread().getContextClassLoader()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't need to modify original code but just switch context classloader. Then, use Utils.withContextClassLoader and just pass original code into fn.

@HeartSaVioR
Copy link
Contributor

Btw, please follow the template of PR description. And we describe PR title for what the patch fixes instead of what was the bug.

@HeartSaVioR
Copy link
Contributor

I'm taking this over, as it's a bit old and the test should be added as well. Please refer #27025.
I've retained the commit to give proper credit. Thanks!

cloud-fan pushed a commit that referenced this pull request Jan 2, 2020
…ardless of current thread context classloader

### What changes were proposed in this pull request?

This patch is based on #23921 but revised to be simpler, as well as adds UT to test the behavior.
(This patch contains the commit from #23921 to retain credit.)

Spark loads new JARs for `ADD JAR` and `CREATE FUNCTION ... USING JAR` into jar classloader in shared state, and changes current thread's context classloader to jar classloader as many parts of remaining codes rely on current thread's context classloader.

This would work if the further queries will run in same thread and there's no change on context classloader for the thread, but once the context classloader of current thread is switched back by various reason, Spark fails to create instance of class for the function.

This bug mostly affects spark-shell, as spark-shell will roll back current thread's context classloader at every prompt. But it may also affects the case of job-server, where the queries may be running in multiple threads.

This patch fixes the issue via switching the context classloader to the classloader which loads the class. Hopefully FunctionBuilder created by `makeFunctionBuilder` has the information of Class as a part of closure, hence the Class itself can be provided regardless of current thread's context classloader.

### Why are the changes needed?

Without this patch, end users cannot execute Hive UDF using JAR twice in spark-shell.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

New UT.

Closes #27025 from HeartSaVioR/SPARK-26560-revised.

Lead-authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]>
Co-authored-by: nivo091 <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
HeartSaVioR added a commit to HeartSaVioR/spark that referenced this pull request Jan 2, 2020
…ardless of current thread context classloader

This patch is based on apache#23921 but revised to be simpler, as well as adds UT to test the behavior.
(This patch contains the commit from apache#23921 to retain credit.)

Spark loads new JARs for `ADD JAR` and `CREATE FUNCTION ... USING JAR` into jar classloader in shared state, and changes current thread's context classloader to jar classloader as many parts of remaining codes rely on current thread's context classloader.

This would work if the further queries will run in same thread and there's no change on context classloader for the thread, but once the context classloader of current thread is switched back by various reason, Spark fails to create instance of class for the function.

This bug mostly affects spark-shell, as spark-shell will roll back current thread's context classloader at every prompt. But it may also affects the case of job-server, where the queries may be running in multiple threads.

This patch fixes the issue via switching the context classloader to the classloader which loads the class. Hopefully FunctionBuilder created by `makeFunctionBuilder` has the information of Class as a part of closure, hence the Class itself can be provided regardless of current thread's context classloader.

Without this patch, end users cannot execute Hive UDF using JAR twice in spark-shell.

No.

New UT.

Closes apache#27025 from HeartSaVioR/SPARK-26560-revised.

Lead-authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]>
Co-authored-by: nivo091 <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan pushed a commit that referenced this pull request Jan 2, 2020
…r regardless of current thread context classloader

### What changes were proposed in this pull request?

This patch is based on #23921 but revised to be simpler, as well as adds UT to test the behavior.
(This patch contains the commit from #23921 to retain credit.)

Spark loads new JARs for `ADD JAR` and `CREATE FUNCTION ... USING JAR` into jar classloader in shared state, and changes current thread's context classloader to jar classloader as many parts of remaining codes rely on current thread's context classloader.

This would work if the further queries will run in same thread and there's no change on context classloader for the thread, but once the context classloader of current thread is switched back by various reason, Spark fails to create instance of class for the function.

This bug mostly affects spark-shell, as spark-shell will roll back current thread's context classloader at every prompt. But it may also affects the case of job-server, where the queries may be running in multiple threads.

This patch fixes the issue via switching the context classloader to the classloader which loads the class. Hopefully FunctionBuilder created by `makeFunctionBuilder` has the information of Class as a part of closure, hence the Class itself can be provided regardless of current thread's context classloader.

### Why are the changes needed?

Without this patch, end users cannot execute Hive UDF using JAR twice in spark-shell.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

New UT.

Closes #27075 from HeartSaVioR/SPARK-26560-branch-2.4.

Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants