Skip to content

Conversation

@learningchess2003
Copy link
Contributor

What changes were proposed in this pull request?

Add analyzer support for named function arguments.

Why are the changes needed?

Part of the project needed for general named function argument support.

Does this PR introduce any user-facing change?

We added support for named arguments for the CountMinSketchAgg and Mask SQL functions.

How was this patch tested?

A new suite was added for this test called NamedArgumentFunctionSuite.

@github-actions github-actions bot added the SQL label Jul 5, 2023
@learningchess2003
Copy link
Contributor Author

@dtenedor @anchovYu Can you leave some review comments on this new PR! I rebased it and it should be in a more compact form ready for review!

Copy link
Contributor

@anchovYu anchovYu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the effort 👍!

"message" : [
"Call to function <functionName> is invalid because it includes multiple argument assignments to the same name <parameterName>."
],
"sqlState" : "42704K"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that some of my previous comments are naturally collapsed so you can't see them without expanding. I'll copy those to this new PR.

The error code should be 5 characters according to https://github.com/apache/spark/tree/master/common/utils/src/main/resources/error.

* @param default The default value of the argument. If the default is none, then that means the
* argument is required. If no argument is provided, an exception is thrown.
*/
case class NamedArgument(name: String,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
case class NamedArgument(name: String,
case class NamedArgument(
name: String,

SELECT mask(c1, replaceArg) from values('abcd-EFGH-8765-4321', 123) as t(c1, replaceArg);
SELECT mask('abcd-EFGH-8765-4321', 'A', 'w', '');
SELECT mask('abcd-EFGH-8765-4321', 'A', 'abc');
SELECT mask('abcd-EFGH-8765-4321', 'A', 'abc');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you revert the changes of this file?

@learningchess2003
Copy link
Contributor Author

@anchovYu Comments addressed!

},
"UNRECOGNIZED_PARAMETER_NAME" : {
"message" : [
"Cannot invoke function <functionName> because the function call included a named argument reference for the argument named <argumentName>, but this function does not include any signature containing an argument with this name."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you help users and propose function parameters sorted by levenshtein distance? See for instance:

private[spark] def orderSuggestedIdentifiersBySimilarity(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! Added the feature.

@learningchess2003
Copy link
Contributor Author

@MaxGekk Comments addressed!

Copy link
Contributor

@dtenedor dtenedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took another review pass, thanks again for working on this!

@learningchess2003
Copy link
Contributor Author

@dtenedor Addressed comments, let me know what you think!

Copy link
Contributor

@anchovYu anchovYu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you update the PR description with the doc and the previous parser PR linked? It can provide more context for other reviewers who are interested. Thanks!

},
"NAMED_ARGUMENTS_NOT_SUPPORTED" : {
"message" : [
"Named arguments are not supported for function <functionName>; please retry the query with positional arguments to the function call instead or enable the feature using the SQL config \"spark.sql.allowNamedFunctionArguments\"."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this error is triggered when the config is false right? It is only because the function itself doesn't implement the named argument trait.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, you are right. There is a separate exception for this.

@github-actions github-actions bot added the DOCS label Jul 6, 2023
expressions: Seq[Expression]) : Seq[Expression] = {
val rearrangedExpressions = if (!builder.functionSignatures.isEmpty) {
val functionSignatures = builder.functionSignatures.get
if (functionSignatures.length != 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't have a plan to support function overload soon, shall we restrict it at compile time? def functionSignatures: Option[Seq[FunctionSignature]] = None can be def functionSignatures: Option[FunctionSignature] = None

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, I imagine we probably would not get around to prioritizing overloads soon with this framework. But if we did, would it become necessary to update all the places where these function signatures are defined, from an Option[FunctionSignature] to an Option[Seq[FunctionSignature]]? In contrast, if we go with the latter syntax of a Seq of just one function signature, we wouldn't have to update all those sites later.

However, I imagine such an update would be a fairly mechanical change, and in the interim, we get better compiler enforcement. It seems like this suggestion is better for now to switch the API to return just an Option[FunctionSignature] for now.

assert(sketch == reference)
}

test("count-min sketch with named-arguments") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can remove this end-to-end test if named-function-arguments.sql includes count_min_sketch

{
"errorClass" : "WRONG_NUM_ARGS.WITHOUT_SUGGESTION",
"sqlState" : "42605",
"errorClass" : "DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel the previous error is better. Shall we tune the named-parameter framework so that it always checks the num args first?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. We shouldn't change the error messages for function invocations that do not use named arguments.

val parameters: Seq[NamedArgument] = functionSignature.parameters
val firstOptionalParamIndex: Int = parameters.indexWhere(_.default.isDefined)
if (firstOptionalParamIndex != -1 &&
parameters.drop(firstOptionalParamIndex).exists(_.default.isEmpty)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: parameters.dropWhile(_.default.isEmpty).exists(_.default.isEmpty) is simpler

functionName, functionSignature)
}

val firstNamedArgIdx: Int = args.indexWhere(_.isInstanceOf[NamedArgumentExpression])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: val (posArgs, namedArgs) = args.span(!_.isInstanceOf[NamedArgumentExpression])

// The following loop checks for the following:
// 1. Unrecognized parameter names
// 2. Duplicate routine parameter assignments
val allParameterNames: Seq[String] = parameters.map(_.name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to confirm: the parameter name is case-sensitive. Is it also true in other systems or the SQL standard? cc @srielau

import org.apache.spark.sql.catalyst.expressions.{Expression, NamedArgumentExpression}
import org.apache.spark.sql.errors.QueryCompilationErrors

object SupportsNamedArguments {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have this object for ease of test? It looks to me that we can move the new trait Builder[T] to a new file and define the defaultRearrange function there.

functionName: String, functionSignature: FunctionSignature) : Throwable = {
val errorMessage = s"Function $functionName has an unexpected required argument for" +
s" the provided function signature $functionSignature. All required arguments should" +
s" come before optional arguments."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s is not needed.

def multipleFunctionSignatures(functionName: String,
functionSignatures: Seq[FunctionSignature]): Throwable = {
var errorMessage = s"Function $functionName cannot have multiple method signatures." +
s" The function signatures found were: \n"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s is not needed.

def positionalAndNamedArgumentDoubleReference(
functionName: String, parameterName: String) : Throwable = {
val errorClass =
"DUPLICATE_ROUTINE_PARAMETER_ASSIGNMENT.POSITIONAL_AND_NAMED_ARGUMENT_DOUBLE_REFERENCE"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not long. Is it possible to make it shorter?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe DUPLICATE_ROUTINE_PARAMETER_ASSIGNMENT.BOTH_POSITIONAL_AND_NAMED?

Comment on lines 119 to 120
var candidatesString = ""
recommendations.foreach(candidatesString += _ + " ")
Copy link
Member

@MaxGekk MaxGekk Jul 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you just use mkString, please.

"inputType" : "\"INT\"",
"paramIndex" : "1",
"requiredType" : "(\"ARRAY\" or \"MAP\")",
"sqlExpr" : "\"explode(1)\""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be explode (1,2)?

"sqlExpr" : "\"explode(1)\""
},
"queryContext" : [ {
"objectType" : "",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add more named argument tests in this file (table-valued-functions.sql) when the function invocation throws exceptions? For example use named arguments for these tests:

select * from explode(explode(array(1)))
-- !query schema
struct<>
-- !query output
org.apache.spark.sql.AnalysisException
{
"errorClass" : "UNSUPPORTED_GENERATOR.NESTED_IN_EXPRESSIONS",

I'd like to make sure the exceptions stay the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that's fine. I fixed the tests now so the errors remain consistent.

Copy link
Contributor

@dtenedor dtenedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good, the only remaining changes left look like following Wenchen's suggestion to change the API to return an Option[FunctionSignature] instead of an Option[Seq[FunctionSignature]] for now (since we don't plan to support overloads in the short term), and some code commenting improvements.

expressions: Seq[Expression]) : Seq[Expression] = {
val rearrangedExpressions = if (!builder.functionSignatures.isEmpty) {
val functionSignatures = builder.functionSignatures.get
if (functionSignatures.length != 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, I imagine we probably would not get around to prioritizing overloads soon with this framework. But if we did, would it become necessary to update all the places where these function signatures are defined, from an Option[FunctionSignature] to an Option[Seq[FunctionSignature]]? In contrast, if we go with the latter syntax of a Seq of just one function signature, we wouldn't have to update all those sites later.

However, I imagine such an update would be a fairly mechanical change, and in the interim, we get better compiler enforcement. It seems like this suggestion is better for now to switch the API to return just an Option[FunctionSignature] for now.

(name, (info, (expressions: Seq[Expression]) => builder(expressions)))
}

def generatorBuilder[T <: GeneratorBuilder : ClassTag](
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a comment for this method

val funcBuilder = (expressions: Seq[Expression]) => {
assert(expressions.forall(_.resolved), "function arguments must be resolved.")
val rearrangedExpressions = FunctionRegistry.rearrangeExpressions(name, builder, expressions)
val plan = builder.build(name, rearrangedExpressions)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need the val plan here since you can just return builder.build(...)


trait ExpressionBuilder {
def build(funcName: String, expressions: Seq[Expression]): Expression
trait Builder[T] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please put some kind of short general comment for this Builder class mentioning that it is an interface for describing the expected parameters for SQL functions and takes responsibility for validating provided arguments for tham.

}

/**
* A trait used for scalar valued functions that defines how their expression representations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please change this to a complete sentence ending with a period (same for L1120 below)

def positionalAndNamedArgumentDoubleReference(
functionName: String, parameterName: String) : Throwable = {
val errorClass =
"DUPLICATE_ROUTINE_PARAMETER_ASSIGNMENT.POSITIONAL_AND_NAMED_ARGUMENT_DOUBLE_REFERENCE"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe DUPLICATE_ROUTINE_PARAMETER_ASSIGNMENT.BOTH_POSITIONAL_AND_NAMED?

import org.apache.spark.sql.catalyst.expressions.{Expression, NamedArgumentExpression}
import org.apache.spark.sql.errors.QueryCompilationErrors

trait FunctionBuilderBase[T] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a general comment describing what this object represents (using full sentences). Same for NamedArgumentsSupport below.

* compare a function call with provided arguments to determine if that function
* call is a match for the function signature.
*
* @return a list of function signatures
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for this line since the above comment conveys that information.

}
}

// Check argument list size against provided parameter list length
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Check argument list size against provided parameter list length
// Check the argument list size against the provided parameter list length.

}
)
}
positionalArgs ++ rearrangedNamedArgs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add an assertion that the size of returning results is equal to the parameter count in function signature? Meanwhile, build can also have similar assertion as well.

* Each function signature includes a list of parameters to which the analyzer can
* compare a function call with provided arguments to determine if that function
* call is a match for the function signature.
*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add comment that we only support 1) one function signature 2) required should come before optional?

}

// Check argument list size against provided parameter list length
if (parameters.size < args.length) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add tests for this case? Is this exception only possible to hit when passing in all positional args?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we already have tests for these like in the context of explode for example.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Could you add one for mask in the bunch of error cases in the e2e tests?

* This method is the default routine which rearranges the arguments in positional order according
* to the function signature provided. This will also fill in any default values that exists for
* optional arguments. This method will also be invoked even if there are no named arguments in
* the argument list.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add more details:

  1. keeps all positional args in the original order
  2. rearranges named arguments according to the order defined in function signature, fill in missing optional arguments with default values

@@ -1,5 +1,50 @@
-- Test for named arguments for Mask
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add more tests on passing in all positional args and filling in the defaults?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current mask-expression.sql test already handles this.

private def expressionGeneratorOuter[T <: Generator : ClassTag](name: String)
: (String, (ExpressionInfo, FunctionBuilder)) = {
val (_, (info, generatorBuilder)) = expression[T](name)
val (_, (info, builder)) = expression[T](name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It this change necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I felt it was because the name I created had the exact same name as the variable above and I didn't want to cause confusion. Also, in other example functions, builder is the variable name used for this kind of object, so I thought it be good to clear up the ambigiuity.

val upperCharArg = NamedArgument("upperChar", Some(Literal(Mask.MASKED_UPPERCASE)))
val lowerCharArg = NamedArgument("lowerChar", Some(Literal(Mask.MASKED_LOWERCASE)))
val digitCharArg = NamedArgument("digitChar", Some(Literal(Mask.MASKED_DIGIT)))
val otherCharArg = NamedArgument("otherChar", Some(Literal(Mask.MASKED_IGNORE)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Existing constructors on mask adds a type information: Literal(Mask.MASKED_IGNORE, StringType)). Could you change it?

}

// Check argument list size against provided parameter list length
if (parameters.size < args.length) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Could you add one for mask in the bunch of error cases in the e2e tests?

Comment on lines 1914 to 1916
"NAMED_ARGUMENTS_NOT_SUPPORTED" : {
"message" : [
"Named arguments are not supported for function <functionName>; please retry the query with positional arguments to the function call instead."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"NAMED_ARGUMENTS_NOT_SUPPORTED" : {
"message" : [
"Named arguments are not supported for function <functionName>; please retry the query with positional arguments to the function call instead."
"NAMED_PARAMETERS_NOT_SUPPORTED" : {
"message" : [
"Named parameters are not supported for function <functionName>; please retry the query with positional arguments to the function call instead."

],
"sqlState" : "4274K"
},
"NAMED_ARGUMENTS_SUPPORT_DISABLED" : {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"NAMED_ARGUMENTS_SUPPORT_DISABLED" : {
"NAMED_PARAMETER_SUPPORT_DISABLED" : {


/**
* This is a base trait that is used for implementing builder classes that can be used to construct
* expressions or logical plans depending on if it is a table-valued or scala-valued function.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* expressions or logical plans depending on if it is a table-valued or scala-valued function.
* expressions or logical plans depending on if it is a table-valued or scalar function.

@cloud-fan
Copy link
Contributor

seems ExpressionInfoSuite has test failures.

],
"sqlState" : "428C4"
},
"UNRECOGNIZED_PARAMETER_NAME" : {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"UNRECOGNIZED_PARAMETER_NAME" : {
"UNRECOGNIZED_ARGUMENT_NAME" : {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I asked Serge, and the original name is right. There's probably some confusion over the technical difference between argument and parameter. Typically, parameter is what each name is referring to.

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if all tests pass

@learningchess2003
Copy link
Contributor Author

Migrating to #42020 due to some problems with branches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants