-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[ES|QL] TEXT_EMBEDDING function definition #135059
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ES|QL] TEXT_EMBEDDING function definition #135059
Conversation
|
Pinging @elastic/es-search-relevance (Team:Search Relevance) |
🔍 Preview links for changed docs |
ℹ️ Important: Docs version tagging👋 Thanks for updating the docs! Just a friendly reminder that our docs are now cumulative. This means all 9.x versions are documented on the same page and published off of the main branch, instead of creating separate pages for each minor version. We use applies_to tags to mark version-specific features and changes. Expand for a quick overviewWhen to use applies_to tags:✅ At the page level to indicate which products/deployments the content applies to (mandatory) What NOT to do:❌ Don't remove or replace information that applies to an older version 🤔 Need help?
|
carlosdelest
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Some minor nits about a common interface and some missing capability checks.
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/analysis/Analyzer.java
Show resolved
Hide resolved
.../src/main/java/org/elasticsearch/xpack/esql/expression/function/inference/TextEmbedding.java
Outdated
Show resolved
Hide resolved
.../src/main/java/org/elasticsearch/xpack/esql/expression/function/inference/TextEmbedding.java
Show resolved
Hide resolved
x-pack/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/analysis/AnalyzerTestUtils.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/analysis/VerifierTests.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/analysis/VerifierTests.java
Show resolved
Hide resolved
x-pack/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/analysis/VerifierTests.java
Show resolved
Hide resolved
Co-authored-by: Carlos Delgado <[email protected]>
…esql_text_embedding_function_definition
| @Param( | ||
| name = InferenceFunction.INFERENCE_ID_PARAMETER_NAME, | ||
| type = { "keyword" }, | ||
| description = "Identifier of the inference endpoint" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to add a function example and then regen the docs
otherwise Kibana CI will fail when we try to bring the json specification of this function to Kibana even if it's under snapshot.
this happened recently for decay too #134705 - opened a separate PR to address this #135094 since I promised to look into it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be fixed in Kibana CI so it does not break in such a case.
I mean, it is expected for a function that have appliesTo set to FunctionAppliesToLifecycle.DEVELOPMENT to be incomplete. Especially examples come at the very last when writing CSV tests.
Anyway, I added an placeholder example so I will not break anything, It will be replaced when adding more realistic CSV tests.
...plugin/esql/src/test/java/org/elasticsearch/xpack/esql/inference/InferenceResolverTests.java
Outdated
Show resolved
Hide resolved
| plan.forEachExpressionUp(UnresolvedFunction.class, f -> { | ||
| String functionName = snapshotRegistry.resolveAlias(f.name()); | ||
| if (snapshotRegistry.functionExists(functionName)) { | ||
| FunctionDefinition def = snapshotRegistry.resolveFunction(functionName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
usually this is done at the analyzer level - it's a bit odd to do the function resolution here, but I don't see another simpler option for now 🤷♀️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, there is other case where this kind of analysis is done outside of the analyzer (enrich policy or indices resolution).
Because the analyzer is purely synchronous and inference resolution has to be done async, it is necessary to do it this way.
| return BytesRefs.toString(e.fold(FoldContext.small())); | ||
| } | ||
|
|
||
| public String inferenceId(UnresolvedFunction f, FunctionDefinition def) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this method is actually needed, should this be private? maybe also add a javadoc to explain what it does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments and improved the code readability.
| } | ||
|
|
||
| @Override | ||
| public void writeTo(StreamOutput out) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to serialize this function - since we will always resolve it on the coordinator and replace it with its result.
We have other instances in ES|QL where we don't serialize - FORK, ROW, RENAME come to mind:
But this would probably be the first function that does not require serialization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to keep the serialization even if the expression is not supposed to be moved between node.
The first reason is because the Function are supposed to implement it. If for some reason the infrastructure of execution need to move them between node in the future, I would not like to be the black sheep that make things more difficult.
Also TextEmbeddingTests extends AbstractFunctionTestCase which expect the function to be serializable.
I consider that the implementation or serialization / deserialization is simple to implement and it is probably not beneficial to let it unimplemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first reason is because the Function are supposed to implement it.
That does not mean it should. If the text_embedding function is actually serialized and sent between nodes, that's an execution path that should never happen.
Anything that extends from LogicalPlan is also supposed to implement serialization, but as I said before there are many examples of plans where we don't serialize.
We could have made the same argument for logical plans that the serialization is simple to implement, or that it's not beneficial to let it unimplemented.
However raising an exception in the case where we attempt to serialize these plans is intentional and it captures very clearly the fact that this should never happen.
I get that making this unserializable is more work, especially for AbstractFunctionTestCase, but at least then the behaviour would be correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels to me that we are removing some code that we will ultimately reintroduce when we will support non constant input text embeddings. Anyway, I pushed a version without serialization so we can move forward and merge this PR.
| if (snapshotRegistry.functionExists(functionName)) { | ||
| FunctionDefinition def = snapshotRegistry.resolveFunction(functionName); | ||
| if (InferenceFunction.class.isAssignableFrom(def.clazz())) { | ||
| String inferenceId = inferenceId(f, def); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
InferenceFunction has an inferenceId method already - I wonder if we can just use that, instead of trying to see which function param is called INFERENCE_ID_PARAMETER_NAME.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My problem is that f is not an InferenceFunction but is an UnresoivedFunction.
The only way to access the inference id is to read the argument that have the right name.
By convention I supposed that it will always be called inference_id for all the InferenceFunction we will implement in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me, the fact that we are looking for the argument that is called inference_id leaks an internal implementation detail of the function.
We could also try and resolve the function - the same way the analyzer does:
elasticsearch/x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/analysis/Analyzer.java
Lines 1405 to 1406 in 8de918b
| FunctionDefinition def = functionRegistry.resolveFunction(functionName); | |
| f = uf.buildResolved(configuration, def); |
this would still not be ideal though - I am okay to leave this as it is - but we should reopen the discussion of making the analyzer support async rules.
| public LogicalPlan apply(LogicalPlan plan, AnalyzerContext context) { | ||
| return plan.transformDown(InferencePlan.class, p -> resolveInferencePlan(p, context)); | ||
| return plan.transformDown(InferencePlan.class, p -> resolveInferencePlan(p, context)) | ||
| .transformExpressionsOnly(InferenceFunction.class, f -> resolveInferenceFunction(f, context)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-blocking question for my ES|QL education: Can you help me understand why this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
plan.transformDown(InferencePlan.class, p -> resolveInferencePlan(p, context))
will transform all the children in the plan that have the class InferencePlan with the result of the resolveInferencePlan(p, context) call. Inference plans are typically command using inference: RERANK and COMPLETION
.transformExpressionsOnly(InferenceFunction.class, f -> resolveInferenceFunction(f, context));
will transform do the same but for InferenceFunction instead of plan. Because text embedding is our first inference function this was not yet done, so I added it.
ioanatia
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks really close - I would like us to look a bit more into the serialization aspect before merging
| if (snapshotRegistry.functionExists(functionName)) { | ||
| FunctionDefinition def = snapshotRegistry.resolveFunction(functionName); | ||
| if (InferenceFunction.class.isAssignableFrom(def.clazz())) { | ||
| String inferenceId = inferenceId(f, def); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me, the fact that we are looking for the argument that is called inference_id leaks an internal implementation detail of the function.
We could also try and resolve the function - the same way the analyzer does:
elasticsearch/x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/analysis/Analyzer.java
Lines 1405 to 1406 in 8de918b
| FunctionDefinition def = functionRegistry.resolveFunction(functionName); | |
| f = uf.buildResolved(configuration, def); |
this would still not be ideal though - I am okay to leave this as it is - but we should reopen the discussion of making the analyzer support async rules.
| } | ||
|
|
||
| @Override | ||
| public void writeTo(StreamOutput out) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first reason is because the Function are supposed to implement it.
That does not mean it should. If the text_embedding function is actually serialized and sent between nodes, that's an execution path that should never happen.
Anything that extends from LogicalPlan is also supposed to implement serialization, but as I said before there are many examples of plans where we don't serialize.
We could have made the same argument for logical plans that the serialization is simple to implement, or that it's not beneficial to let it unimplemented.
However raising an exception in the case where we attempt to serialize these plans is intentional and it captures very clearly the fact that this should never happen.
I get that making this unserializable is more work, especially for AbstractFunctionTestCase, but at least then the behaviour would be correct.
…esql_text_embedding_function_definition
This PR introduces a new TEXT_EMBEDDING function to ES|QL, enabling users to generate dense vector embeddings for text directly within their queries.
Examples
Example 1: Generating embeddings
Example 2: Semantic search with KNN
This work is part of #131079