Add UDFRegister functions for Spark 2.4 #67

FelixEngl · 2020-10-06T14:14:54Z

Changes to old pull request

Correct branching.

Short description

Add a kotlin-styled way to create and call UDFs in a more secure manner.

Possible improvements

Add more Unit-Tests?
Publish generator code for the functions (but i don't know where i should put it)

Future plans

Add advanced type-checking before calling a function by comparing the schematas of the function parameters with the column schema?

asm0dey · 2020-11-29T19:33:40Z

kotlin-spark-api/2.4/src/main/kotlin/org/jetbrains/kotlinx/spark/api/UDFRegister.kt

+/**
+ * A shortcut for [KSparkSession.spark].udf()
+ */
+inline fun KSparkSession.udf(): UDFRegistration = spark.udf()


We definitely don't need inline.
And why won't we put it inside KSparkSession?

asm0dey · 2020-11-29T19:34:29Z

kotlin-spark-api/2.4/src/main/kotlin/org/jetbrains/kotlinx/spark/api/UDFRegister.kt

@@ -0,0 +1,803 @@
+@file:Suppress("NOTHING_TO_INLINE", "DuplicatedCode", "MemberVisibilityCanBePrivate")


If this file is generated (I really hope it is) please provide us with generating code in comments. Thanks!

Sorry for not answering for a long time, was busy with another project. It's very good PR, but needs several iprovements.

No problem, and i thought so (especially with the improvements). This was only some kind of strange brainwave of mine and i hacked something together.

I will provide the generator-code after some clean-up, i only hacked it down for my own project and thought you might like/need it. :D
If you have further suggestions please let me know, i'll try to add them to the implementation.

(sorry that i can't do it today, but it's 00:30 in germany and i have a meeting in the morning.)

There is no hurry at all, thank you for your effort.
How do you think if there is any way to support other classes natively supported by Spark?

In my fork is a branch for calling the scala-java-conversion-wrappers (i didnt call for a merge request, because i'm not really satisfied with the tests for it. i have problems testing the mutable scala types for changes, if you could help me with that [or give me some tipps how to write them] i would be thankfull).

I think we can achieve some kind of "auto-conversion" by using reified and calling these wrapper-functions. (at least for the function-wrappers)

I wrote this text on my smartphone, please excuse the typing error.

I think if you would file a PR we at least could think on how to test something :)

I'll take care for the code around christmas. Right now i don't have enough time to work on my PRs besides my current tasks.
I also commited the PR for the Wrappers (see #72)

Thank you for your feedback in advance! 😄

asm0dey · 2020-11-29T19:37:00Z

Sorry for not answering for a long time, was busy with another project. It's very good PR, but needs several iprovements.

asm0dey · 2020-11-30T05:53:58Z

Please, forward-port this to version 3.0

asm0dey · 2020-11-30T07:26:19Z

I've following trouble:

I've tried to write more complex test:

                should("also work with datasets") {
                    listOf("a" to 1, "b" to 2).toDS().toDF().createOrReplaceTempView("test1")
                    udf.register<String, Int, Int>("stringIntDiff") { a, b ->
                        a[0].toInt() - b
                    }
                    spark.sql("select stringIntDiff(first, second) from test1").show()

                }

and it fails with

IntegerType (of class org.apache.spark.sql.KSimpleTypeWrapper)
scala.MatchError: IntegerType (of class org.apache.spark.sql.KSimpleTypeWrapper)
	at org.apache.spark.sql.catalyst.encoders.RowEncoder$.externalDataTypeFor(RowEncoder.scala:225)
	at org.apache.spark.sql.catalyst.encoders.RowEncoder$.externalDataTypeForInput(RowEncoder.scala:222)
	at org.apache.spark.sql.catalyst.expressions.objects.ValidateExternalType.<init>(objects.scala:1728)
	at org.apache.spark.sql.catalyst.encoders.RowEncoder$.$anonfun$serializerFor$3(RowEncoder.scala:185)
	at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
	at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
	at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
	at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:198)
	at org.apache.spark.sql.catalyst.encoders.RowEncoder$.serializerFor(RowEncoder.scala:181)
	at org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:61)
	at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:79)
	at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642)
	at org.jetbrains.kotlinx.spark.api.UDFRegisterTest$1$1$3$1$2.invokeSuspend(UDFRegisterTest.kt:104)
……

Looks like encoders won't work for our primitive type wrappers.

# Conflicts: # kotlin-spark-api/2.4/src/main/kotlin/org/jetbrains/kotlinx/spark/api/UDFRegister.kt # kotlin-spark-api/2.4/src/test/kotlin/org/jetbrains/kotlinx/spark/api/UDFRegisterTest.kt

* copy code from #67 * remove hacked RowEncoder.scala * replace all schema(typeOf<R>()) to DataType.fromJson((schema(typeOf<R>()) as DataTypeWithClass).dt().json()) * add return udf data class test * change test * add in dataset test for calling the UDF-Wrapper * add the same exception link * refactor unWrapper * add test for udf return a List * make the test simpler * add License * add UDFRegister for 3.0 * remove useless import * resolved deprecated method * [experimental] add CatalystTypeConverters.scala for hacked it to implement UDF return data class * [experimental] implement UDF return data class * fix code inspection issue * Adds suppre unused Co-authored-by: can wang <[email protected]> Co-authored-by: Pasha Finkelshteyn <[email protected]>

Jolanrensen · 2022-07-29T11:36:00Z

Closing this. We don't support spark 2 anymore and based on the UDFs for spark 3 we implemented the registration: #152 (released at v1.2.0)
Thanks for the help and inspiration @FelixEngl !

FelixEngl added 2 commits October 5, 2020 15:50

Add UDFRegister functions.

8a992b9

Remove wrong returnType property from documentation.

cba33f1

asm0dey suggested changes Nov 29, 2020

View reviewed changes

asm0dey and others added 5 commits December 1, 2020 01:36

Initial version of UDF support

7773e51

Merge remote-tracking branch 'origin/UDFRegister' into UDFRegister

4a47e7b

# Conflicts: # kotlin-spark-api/2.4/src/main/kotlin/org/jetbrains/kotlinx/spark/api/UDFRegister.kt # kotlin-spark-api/2.4/src/test/kotlin/org/jetbrains/kotlinx/spark/api/UDFRegisterTest.kt

Remove CL

bd685ec

Apply changes from remote udf-initial branch

92e1677

Restore .github/workflows/build.yml

0622d41

asm0dey force-pushed the master branch from 29debc4 to 5aa8aa0 Compare January 14, 2021 18:03

asm0dey force-pushed the master branch 2 times, most recently from db91ee0 to aa11744 Compare May 7, 2021 16:01

asm0dey force-pushed the main branch 4 times, most recently from a9da30c to 8e7523a Compare July 15, 2021 20:53

nonpool mentioned this pull request Aug 21, 2021

Hope to complete UDFRegister functions #104

Closed

nonpool added a commit to nonpool/kotlin-spark-api that referenced this pull request Aug 23, 2021

copy code from Kotlin#67

5f7a0e2

nonpool mentioned this pull request Aug 24, 2021

Add UDF support #106

Merged

Jolanrensen closed this Jul 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add UDFRegister functions for Spark 2.4 #67

Add UDFRegister functions for Spark 2.4 #67

Uh oh!

FelixEngl commented Oct 6, 2020

Uh oh!

asm0dey Nov 29, 2020

Uh oh!

asm0dey Nov 29, 2020

Uh oh!

FelixEngl Nov 29, 2020

Uh oh!

asm0dey Nov 30, 2020

Uh oh!

FelixEngl Nov 30, 2020

Uh oh!

asm0dey Dec 1, 2020

Uh oh!

FelixEngl Dec 9, 2020

Uh oh!

asm0dey commented Nov 29, 2020

Uh oh!

asm0dey commented Nov 30, 2020

Uh oh!

asm0dey commented Nov 30, 2020 •

edited

Loading

Uh oh!

Jolanrensen commented Jul 29, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -0,0 +1,803 @@
		@file:Suppress("NOTHING_TO_INLINE", "DuplicatedCode", "MemberVisibilityCanBePrivate")

Add UDFRegister functions for Spark 2.4 #67

Add UDFRegister functions for Spark 2.4 #67

Uh oh!

Conversation

FelixEngl commented Oct 6, 2020

Changes to old pull request

Short description

Possible improvements

Future plans

Uh oh!

asm0dey Nov 29, 2020

Choose a reason for hiding this comment

Uh oh!

asm0dey Nov 29, 2020

Choose a reason for hiding this comment

Uh oh!

FelixEngl Nov 29, 2020

Choose a reason for hiding this comment

Uh oh!

asm0dey Nov 30, 2020

Choose a reason for hiding this comment

Uh oh!

FelixEngl Nov 30, 2020

Choose a reason for hiding this comment

Uh oh!

asm0dey Dec 1, 2020

Choose a reason for hiding this comment

Uh oh!

FelixEngl Dec 9, 2020

Choose a reason for hiding this comment

Uh oh!

asm0dey commented Nov 29, 2020

Uh oh!

asm0dey commented Nov 30, 2020

Uh oh!

asm0dey commented Nov 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Jolanrensen commented Jul 29, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

asm0dey commented Nov 30, 2020 •

edited

Loading