Skip to content

Conversation

@ferenc-csaky
Copy link
Collaborator

Added the functions to the function-common module, under a new com.datasqrl.flinkrunner.functions.common package.

Also unified the package naming in every function module:

  • replaced stdlib with functions in the math and openai modules
  • added flinkrunner in the text module

@ferenc-csaky ferenc-csaky force-pushed the add-common-functions-from-sqrl branch from f2d81af to abd4a39 Compare June 12, 2025 12:54
@mbroecheler
Copy link
Contributor

mbroecheler commented Jun 13, 2025

The stdlib is an important keyword that is used by DataSQRL to give you an easy way to import these functions. It automatically discovers all functions under com.datasqrl.stdlib which can then be imported via:

IMPORT stdlib.commons.noop;

We need to document that somewhere in this repository as a convention to not break it in the DataSQRL repo ;-)
For the system functions the package name is not important since those are auto-discovered by their interface and added automatically. For the standard library functions, we use the convention to use their package path as the import path (truncating the com.datasqrl part and starting from stdlib).

Unrelated, should we use commons like Apache does?

@ferenc-csaky
Copy link
Collaborator Author

I like the commons suggestion!

As far as I can tell stdlib itself does not matter, the ClasspathFunctionLoader discovers everything under com.datasqrl., and com.datasqrl.flinkrunner., so with the current impl IMPORT functions.common.noop; should work.

I am not necessarily against stdlib, maybe it can be a bit confusing, cause in general standard library means a lot more things than functions. IMO pretty much everything we have in this repository could be considered as pat of the SQL runner stdlib.

But we may double down on it, and we could create an stdlib module, that contains functions, and types for starters. It is reasonable to keep connectors separated, I have mixed feelings regarding formats, but anyways. I was already thinking about that currently types and functions are bit mixed. Both json and vector have type definitions, but they also have a bunch of functions as well, which are placed next to them, not the functions. So maybe we can create an stdlib maven module that merges functions and types, and we can have a com.datasqrl.flinkrunner.stdlib. root package, and a specific package per group:

  • com.datasqrl.flinkrunner.stdlib.commons.*
  • com.datasqrl.flinkrunner.stdlib.math.*
  • com.datasqrl.flinkrunner.stdlib.json.* (both type and function)
  • ...

Maybe I miss something here, but since we create an uber JAR, I do not really see the benefit of grouping everything under specific maven sub-modules, other than separating well-defined groups, but IMO curerently that's not true.

@mbroecheler
Copy link
Contributor

You are correct @ferenc-csaky. We could use "functions". The reason we are not is because I've seen multiple users create a "functions" folder to store their UDFs in for a DataSQRL project and then the import mechanism prioritizes the folder "functions" over the classpath resolution. Which means they are no longer accessible. That issue also exists for stdlib, but that name is less likely to be chosen as a folder name based on evidence to date.

Yes, it makes sense to me to keep the type and functions together. The vector functions don't make sense without the type. And vice versa. Same for jsonb.

The only reason for the sub-modules is that it allows users to pick and chose if they want to build their own flink sql runner, e.g. they could import the vector type and functions only.

@ferenc-csaky
Copy link
Collaborator Author

ferenc-csaky commented Jun 16, 2025

That's a valid point regarding sub-modules, I was not sure we wanna try shoot for that or not. Considering everything, WDYT about the following:

  • Merge the current functions and types modules into one, called stlib
  • Under stdlib, keep the current sub-modules, removing their suffix, so we would have the following module structure:
    • stdlib/stdlib-commons
    • stdlib/stdlib-json
    • stdlib/stdlib-math
    • stdlib/stdlib-openai
    • stdlib/stdlib-text
    • stdlib/stdlib-vector
  • The same would be applied for the packages
    • com.datasqrl.flinkrunner.stdlib.commons
    • com.datasqrl.flinkrunner.stdlib.json
    • ...

From a strict module naming perspective, not duplicating stdlib would be better IMO, but since users may want to interact with the sub-modules directly, the artifact name should definitely contain the stdlib as well. And we may rename the JAR and keep the module name, but I don't think it's worth the extra plugin configuration code.

@velo velo requested a review from Copilot June 16, 2025 13:24
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR centralizes common UDFs into a shared module and standardizes package names across function libraries.

  • Introduced three new system functions (serialize_to_bytes, noop, hash_columns) in functions-common.
  • Renamed packages from stdlib to functions in OpenAI and math modules.
  • Added an auto-service dependency (missing version) and updated the README import path.

Reviewed Changes

Copilot reviewed 36 out of 36 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
functions/openai-functions/src/main/java/com/datasqrl/flinkrunner/functions/openai/OpenAICompletions.java Renamed package from stdlib.openai to functions.openai.
functions/math-functions/src/main/java/com/datasqrl/flinkrunner/functions/math/* Renamed packages from stdlib.math to functions.math.
functions/functions-common/src/main/java/com/datasqrl/flinkrunner/functions/common/serialize_to_bytes.java New UDF to serialize objects to bytes.
functions/functions-common/src/main/java/com/datasqrl/flinkrunner/functions/common/noop.java New no-op UDF that always returns true.
functions/functions-common/src/main/java/com/datasqrl/flinkrunner/functions/common/hash_columns.java New UDF to MD5-hash multiple column values.
functions/functions-common/pom.xml Added com.google.auto-service:auto-service dependency without a version.
README.md Updated SQL import syntax from stdlib.[library-name].* to functions.[library-name].*.
Comments suppressed due to low confidence (2)

functions/functions-common/src/main/java/com/datasqrl/flinkrunner/functions/common/serialize_to_bytes.java:31

  • There are no unit tests for this serialization function. Consider adding tests that verify correct byte output across a variety of input types and error scenarios.
public byte[] eval(@DataTypeHint(inputGroup = InputGroup.ANY) Object object) {

functions/functions-common/src/main/java/com/datasqrl/flinkrunner/functions/common/hash_columns.java:40

  • [nitpick] Instantiating a new MessageDigest per call can be costly. Consider reusing a ThreadLocal<MessageDigest> or pooling instances to reduce GC pressure.
var digest = MessageDigest.getInstance("MD5");

Copy link
Collaborator

@velo velo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

I expect a follow up PR on sqrl, right?

I would change the snashot version to make sure sqrl won't be using an old library.
Might also be the case here, might be downloading an snapshot from github packages.

@ferenc-csaky
Copy link
Collaborator Author

LGTM.

I expect a follow up PR on sqrl, right?

I would change the snashot version to make sure sqrl won't be using an old library. Might also be the case here, might be downloading an snapshot from github packages.

Correct, these changes has to be reflected on the sqrl side as well. I'm not sure if it should hold this PR and merge them "together", or it's okay to merge this head on, and then adapt sqrl, open PR, merge that.

@ferenc-csaky ferenc-csaky force-pushed the add-common-functions-from-sqrl branch 2 times, most recently from 6342137 to e05bac8 Compare June 17, 2025 12:33
@ferenc-csaky ferenc-csaky force-pushed the add-common-functions-from-sqrl branch from e05bac8 to 02f9f43 Compare June 18, 2025 12:15
@ferenc-csaky ferenc-csaky merged commit 26c6409 into main Jun 18, 2025
2 checks passed
@ferenc-csaky ferenc-csaky deleted the add-common-functions-from-sqrl branch June 18, 2025 12:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants