-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-26560][SQL] Spark should be able to run Hive UDF using jar regardless of current thread context classloader #27025
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
||
| private var threadContextClassLoader: ClassLoader = _ | ||
|
|
||
| override protected def beforeEach(): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
beforeEach and afterEach are needed to make a new UT pass, as some tests change the current thread's context classloader to jar classloader.
|
The ideal fix would be not changing current thread's context classloader in addJar, and still make things work. I guess there're many places relying on context classloader, so that requires broader changes. |
|
I mentioned Hive UDF as I haven't heard about creating Spark UDF function using jar. Please let me know if that's not the case. |
|
Test build #115861 has finished for PR 27025 at commit
|
…lassloader for tests in SQLQuerySuite didn't work
|
Test build #115880 has finished for PR 27025 at commit
|
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala
Show resolved
Hide resolved
| |CREATE FUNCTION udtf_count3 | ||
| |AS 'org.apache.hadoop.hive.contrib.udtf.example.GenericUDTFCount3' | ||
| |USING JAR '$jarURL' | ||
| """.stripMargin) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: indent
|
|
Hmm... the jar shouldn't be in classpath. What about |
|
sorry, but I don't have a smart idea, either... cc: @HyukjinKwon @cloud-fan |
|
Test build #115896 has finished for PR 27025 at commit
|
It doesn't seem to allow same name - I'll just rename the jar to |
|
Test build #115913 has finished for PR 27025 at commit
|
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala
Show resolved
Hide resolved
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala
Show resolved
Hide resolved
|
Test build #115953 has finished for PR 27025 at commit
|
| udfExpr.get.asInstanceOf[HiveGenericUDTF].elementSchema // Force it to check data types. | ||
| // Current thread context classloader may not be the one loaded the class. Need to switch | ||
| // context classloader to initialize instance properly. | ||
| Utils.withContextClassLoader(clazz.getClassLoader) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it guaranteed that clazz.getClassLoader is the sharedState.jarClassLoader?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the class is from classpath (not loaded from addJar), it would be spark ClassLoader instead of jarClassLoader, though jarClassLoader may be able to load it as it contains Spark classloader. So just changing to jarClassLoader may work in most cases, but this would also work for the classloader which dynamically loads the classes, as we're using classloader which "loaded" the class we want to instantiate.
cloud-fan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except one question
|
thanks, merging to master! @HeartSaVioR can you open another PR for 2.4? |
|
Thanks all for reviewing and merging!
Sure, I'll submit a PR for 2.4 as well. Thanks! |
…ardless of current thread context classloader This patch is based on apache#23921 but revised to be simpler, as well as adds UT to test the behavior. (This patch contains the commit from apache#23921 to retain credit.) Spark loads new JARs for `ADD JAR` and `CREATE FUNCTION ... USING JAR` into jar classloader in shared state, and changes current thread's context classloader to jar classloader as many parts of remaining codes rely on current thread's context classloader. This would work if the further queries will run in same thread and there's no change on context classloader for the thread, but once the context classloader of current thread is switched back by various reason, Spark fails to create instance of class for the function. This bug mostly affects spark-shell, as spark-shell will roll back current thread's context classloader at every prompt. But it may also affects the case of job-server, where the queries may be running in multiple threads. This patch fixes the issue via switching the context classloader to the classloader which loads the class. Hopefully FunctionBuilder created by `makeFunctionBuilder` has the information of Class as a part of closure, hence the Class itself can be provided regardless of current thread's context classloader. Without this patch, end users cannot execute Hive UDF using JAR twice in spark-shell. No. New UT. Closes apache#27025 from HeartSaVioR/SPARK-26560-revised. Lead-authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]> Co-authored-by: nivo091 <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
| // mismatch, etc. Here we catch the exception and throw AnalysisException instead. | ||
| if (classOf[UDF].isAssignableFrom(clazz)) { | ||
| udfExpr = Some(HiveSimpleUDF(name, new HiveFunctionWrapper(clazz.getName), input)) | ||
| udfExpr.get.dataType // Force it to check input data types. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found a potential problem: here we call HiveSimpleUDF.dateType (which is a lazy val), to force to load the class with the corrected class loader.
However, if the expression gets transformed later, which copies HiveSimpleUDF, then calling HiveSimpleUDF.dataType will re-trigger the class loading, and at that time there is no guarantee that the corrected classloader is used.
I think we should materialize the loaded class in HiveSimpleUDF.
@HeartSaVioR can you take a look?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pinging me.
Could you please confirm my understanding? Actually my knowledge to resolve this issue came from debugging (like, reverse-engineering) so I'm not sure I get it 100%.
If my understanding is correct, this seems to be the simple reproducer - could you please confirm I understand correctly?
// uses classloader which loads clazz
val udf = HiveSimpleUDF(name, new HiveFunctionWrapper(clazz.getName), input)
udf.dataType
val newUdf = udf.makeCopy(Array.empty)
// change classloader which doesn't load clazz
newUdf.dataType
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, HiveSimpleUDF needs to load class when dataType is first called. So even if we load the class here in HiveSessionCatalog, but once HiveSimpleUDF is copied during transformation, it needs to load class again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I figured out above code doesn't give error - HiveFunctionWrapper stores instance which is copied in makeCopy() - so once the instance is created it doesn't seems to require changing classloader.
That said, below code gives error:
// uses classloader which loads clazz
val udf = HiveGenericUDTF(name, new HiveFunctionWrapper(clazz.getName), input)
// make sure HiveFunctionWrapper.createFunction is not called here
// change classloader which doesn't load clazz
val newUdf = udf.makeCopy(udf.productIterator.map(_.asInstanceOf[AnyRef]).toArray)
newUdf.dataType
Interestingly, like below, if we do makeCopy with classloader which loads clazz, it also doesn't give any error:
// uses classloader which loads clazz
val udf = HiveGenericUDTF(name, new HiveFunctionWrapper(clazz.getName), input)
// make sure HiveFunctionWrapper.createFunction is not called here
val newUdf = udf.makeCopy(udf.productIterator.map(_.asInstanceOf[AnyRef]).toArray)
// change classloader which doesn't load clazz
newUdf.dataType
we force call .dataType after creating HiveXXXUDF, so if my understanding is correct it won't matter.
Could you please check whether my observation is correct, or please let me know if I'm missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The experimental UT code I used is below (added to SQLQuerySuite.scala) :
test("SPARK-26560 ...experimenting Wenchen's comment...") {
// force to use Spark classloader as other test (even in other test suites) may change the
// current thread's context classloader to jar classloader
Utils.withContextClassLoader(Utils.getSparkClassLoader) {
withUserDefinedFunction("udtf_count3" -> false) {
val sparkClassLoader = Thread.currentThread().getContextClassLoader
// This jar file should not be placed to the classpath; GenericUDTFCount3 is slightly
// modified version of GenericUDTFCount2 in hive/contrib, which emits the count for
// three times.
val jarPath = "src/test/noclasspath/TestUDTF-spark-26560.jar"
val jarURL = s"file://${System.getProperty("user.dir")}/$jarPath"
val className = "org.apache.hadoop.hive.contrib.udtf.example.GenericUDTFCount3"
sql(
s"""
|CREATE FUNCTION udtf_count3
|AS '$className'
|USING JAR '$jarURL'
""".stripMargin)
assert(Thread.currentThread().getContextClassLoader eq sparkClassLoader)
// JAR will be loaded at first usage, and it will change the current thread's
// context classloader to jar classloader in sharedState.
// See SessionState.addJar for details.
sql("SELECT udtf_count3(a) FROM (SELECT 1 AS a FROM src LIMIT 3) t")
assert(Thread.currentThread().getContextClassLoader ne sparkClassLoader)
assert(Thread.currentThread().getContextClassLoader eq
spark.sqlContext.sharedState.jarClassLoader)
// uses classloader which loads clazz
val name = "default.udtf_count3"
val input = Array(AttributeReference("a", IntegerType, nullable = false)())
val udf = HiveGenericUDTF(name, new HiveFunctionWrapper(className), input)
// FIXME: uncommenting below line will lead test passing
// udf.dataType
// Roll back to the original classloader and run query again. Without this line, the test
// would pass, as thread's context classloader is changed to jar classloader. But thread
// context classloader can be changed from others as well which would fail the query; one
// example is spark-shell, which thread context classloader rolls back automatically. This
// mimics the behavior of spark-shell.
Thread.currentThread().setContextClassLoader(sparkClassLoader)
// FIXME: doing this "within" the context classloader which loads the UDF class will
// lead test passing even we comment out udf.dataType
val newUdf = udf.makeCopy(udf.productIterator.map(_.asInstanceOf[AnyRef]).toArray)
newUdf.dataType
}
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK let me put my findings: If you look at HiveFunctionWrapper.createFunction, it says we don't cache the instance for Simple UDF
def createFunction[UDFType <: AnyRef](): UDFType = {
if (instance != null) {
instance.asInstanceOf[UDFType]
} else {
val func = Utils.getContextOrSparkClassLoader
.loadClass(functionClassName).newInstance.asInstanceOf[UDFType]
if (!func.isInstanceOf[UDF]) {
// We cache the function if it's no the Simple UDF,
// as we always have to create new instance for Simple UDF
instance = func
}
func
}
}
I don't know the history but I assume "we always have to create new instance for Simple UDF" is correct. I think what we can do is to cache the loaded Class as well as the instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh OK. I missed the case we don't cache the function. Thanks for the pointer!
I'll try to reproduce the finding, and fix it without touching assumption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What changes were proposed in this pull request?
This patch is based on #23921 but revised to be simpler, as well as adds UT to test the behavior.
(This patch contains the commit from #23921 to retain credit.)
Spark loads new JARs for
ADD JARandCREATE FUNCTION ... USING JARinto jar classloader in shared state, and changes current thread's context classloader to jar classloader as many parts of remaining codes rely on current thread's context classloader.This would work if the further queries will run in same thread and there's no change on context classloader for the thread, but once the context classloader of current thread is switched back by various reason, Spark fails to create instance of class for the function.
This bug mostly affects spark-shell, as spark-shell will roll back current thread's context classloader at every prompt. But it may also affects the case of job-server, where the queries may be running in multiple threads.
This patch fixes the issue via switching the context classloader to the classloader which loads the class. Hopefully FunctionBuilder created by
makeFunctionBuilderhas the information of Class as a part of closure, hence the Class itself can be provided regardless of current thread's context classloader.Why are the changes needed?
Without this patch, end users cannot execute Hive UDF using JAR twice in spark-shell.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
New UT.