-
Notifications
You must be signed in to change notification settings - Fork 160
Updating data filters #278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Here is the parsed workflow data section: https://github.com/tsurdilo/specification/blob/updatefilters2/specification.md#Workflow-Data |
actionDataFilter: | ||
dataResultsPath: "${ .processed }" | ||
actionFilter: | ||
results: "${ .processed }" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit; if I'm understanding correctly, in actionFilter it's 'results', in stateFilter it's 'output' and in eventFilter it's 'data'? Should we try to use similar names for similar concepts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think naming them the same thing is more confusing honestly. As mentioned in pr description this uses natural language concept which imo is better:
"A state data filter can filter the state input and then filter the state output"
"An event data filter can filter the event data and merge it toStateData element X"
"An action data filter can filter elements fromStateData, it can also filter the action results and merge them toStateData element Y"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a nit; I'm fine if we disagree on this so I'll only add that for actions could be considered to have 'output'.
Also, fwiw, that description appears to be referring to the wrong filter in the 3rd line. It appears to be actionDataFilter but says 'An event filter'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops ok will fix. sorry.
so actions should have "results" / "result" or "output. I am fine with any of those. Just let me know what makes most sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If nobody else weighs in, let's just stick with results I guess.
schema/workflow.json
Outdated
}, | ||
"toStateData": { | ||
"type": "string", | ||
"description": " Workflow expression that selects a state data element to which the event payload should be added/merged into. If not specified denotes the top-level state data element." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This bit is confusing. Should this be a path and not an expression? Treating this as an expression implies that a user could do something like the following here: ${ .artObjects[] | select(.productionPlaces | length >= 1) | .id }
. Since this is supposed to tell the runtime where to merge the output, what should a runtime do with this expression?
By path, I mean something like .data
vs. an expression like ${ .data }
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could change to "selection expression" . would that be clearer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That could work, I'd be interested to hear others thoughts if they have any.
I think the really confusing part is that the examples all have '${ .data }' which to me implies something to be extracted from the stateData, rather than indicating a location in the state-data where the results should be merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean the "data" element? This comes from the CE format where the event payload is in the "data" context attribute. and most of the time "data" is the top-level property
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry, my question isn't very clear.
"toStateData": ".foo"
This looks like exactly what it is, something indicating a location in a json structure where results
should be merged into the state data.
"toStateData": "${ .foo }"
This is completely unclear, to the uninitiated, what this is supposed to do. Is it extracting some value from .foo, and that value will tell the runtime where to merge the results
? Or, is the runtime supposed to use '.foo' as the location where the merge should occur?
The ${}
indicates to the reader that some interpolation or evaluation of that expression is to occur, and the results of that expression are what should actually be used so expressing toStateData in this way is extremely confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i do think i see your point. each expression has a result so toStateData might not fit maybe?
the problem i have is then if its not an expression, but the way you have it ".foo" is still an expression ....
the intent is to say "merge into the "foo" element of the state data and create one if it does not exist"
I think we should create an issue and bring it up in a meeting to see opinions but I think for this pr its fine
When merging, the state data element and the data (payload)/action result should have the same type, meaning | ||
that you should not merge arrays with objects or objects with arrays etc. | ||
|
||
When merging elements of type object should be done by inserting all the key-value pairs from both objects into |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking more about the merge strategy this morning and I have a concern that this definition of 'merge' is effectively accumulation only. Meaning, there doesn't seem to be a way for a user to remove data they no longer need from the workflow data.
This is a problem for runtimes because for performance reasons most will want to enforce a max size for the workflow data.
I would propose that this only supports a 'shallow' merge, in that we take the output of an action (for example) and place it in the workflow data at the path specified by the action-data-filter, but we don't merge any existing data from that same path into the results we just over-write it.
For example:
workflow data
{
d1: {
foo: []
}
d2: {
bar: {
baz: 1
}
}
}
Action data filter:
{
"fromStateData": "${ .d1 }",
"results": "${ .results }",
"toStateData": ".d2"
}
The action returns:
{ results: {} }
The resulting state data after the action-data-filter is applied would be:
{
d1: {
foo: []
}
d2: {}
}
This has the benefit for workflows where the use-case is simply to pass data between some actions and not accumulate all results from all actions in the workflow, then the action-data-filter could specify:
{
"fromStateData": "${ . }",
"results": "${ .results }",
"toStateData": "."
}
This would have the impact of passing the current state-data as input to the action, and over-writing the whole state-data with the output of the action.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the main driver to keep the workflow data as small as possible (for performance) is the state filters
i think a good practice for big event data/ large function results would be to first trim them down using event/action filters before merging them to state data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This deep-merging behavior is complex, and leads to failure modes that users won't expect like being surprised by a runtime error when they merge 2 data-structures that overlap, but some of the data-types are different (array vs. object vs string etc).
I suspect most users would naturally expect a shallow merge and that very few users would actually want/need deep-merging. Further, it adds new failure modes that the users have to watch for (like the disparate types problem) and adds a lot of complexity for runtimes to implement.
Finally, if a user really wants to deep-merge some data they've collected from multiple actions, if we were to only support shallow-merge then they can still do the deep-merge via fromStateData before passing the merged data to an action.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok that sounds reasonable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still seems to be describing a somewhat-deep merge. I'm confused on whether we've resolved this or not.
I'd like to propose the following (happy to move this to a follow-up issue if you prefer):
- Support simple insert, no merging.
- Add the ability for customers to reference the state data in their actionDataFilter.results or eventDataFilter.data expressions, which gives them the power to specify custom 'merge' behavior.
WHY?
Because this is much simpler to understand, there are no surprises about deep or shallow merging its just a simple insert by default. It also lets users leverage the power of JQ to do custom 'merge' strategies.
Consider the example:
results: ${.}
toStateData: .customer
This would simply insert the action result at the location .customer in state data. If something already exists there in state data, this would completely over-write that data.
What if users really want to do merge?
We can give users much more flexibility, though it requires that the runtime make the state data available as a pre-defined jq variable.
With that we can give the user all kinds of flexibility.
(These examples assume the action result, referenced by .
is a customer object)
Here's an example where we do a simple merge via actionDataFilter:
results: ${$stateData.customer + .}
toStateData: .customer
The above example takes customer from stateData and shallow merges it with the action result, reassigning the results back to .customer in the state data.
Here's an example where we do a recursive merge via actionDataFilter:
results: ${$stateData.customer * .}
toStateData: .customer
The above example takes customer from stateData and recursively merges it with the action result, reassigning the results back to .customer in the state data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went ahead and made a separate issue for this: #286
@ricardozanini @jorgenj |
@@ -480,7 +480,7 @@ into. With this, after our action executes the state data would be: | |||
|
|||
| Parameter | Description | Type | Required | | |||
| --- | --- | --- | --- | | |||
| data | Workflow expression that filters of the event data (payload) | string | no | | |||
| data | Workflow expression that filters the event data (payload) | string | no | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this imply that they can only access the 'data' portion of the event and not other attributes of the event?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the context attributes of the CE are to be used when you define the event (event definition) and when you define your correlations. When state consumes that defined event we are dealing with its data (payload) only and we require it to be json format. given that the CE event itself is not always json format I think that is reasonable. wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a brief skim of CE, the first thing I notice is an ID attribute that is a top-level attribute of the CE. It's supposed to uniquely identify the event. With this behavior, it's impossible for a workflow to reference that ID, right?
The subject and time attributes are other top-level attributes of a CE that seem like they might be useful for a workflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the event that you consume is defined in the event definition and the correlation.
a CE over http can look like this:
POST /someresource HTTP/1.1
Host: webhook.example.com
ce-specversion: 1.0
ce-type: com.example.someevent
ce-time: 2018-04-05T03:56:24Z
ce-id: 1234-1234-1234
ce-source: /mycontext/subcontext
.... further attributes ...
Content-Type: application/json; charset=utf-8
Content-Length: nnnn
{
... application data ...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you want those data to be available put them in the payload of the event that is json
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is just an example given in the CE spec doc that is written in json. CE has a number of defined protocol bindings,
https://github.com/cloudevents/spec
see the *-protocol-bindings.md documents
we can enforce CE for defining what events workflow needs with the required context attributes, but we cannot enforce that the CE message is in json format.
the runtimes typically get the entire event (whatever protocol/format it comes in) and can add all the relevant information, but from our point of view how can we assure that every single one does it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are limitations ..."data" being json is pretty much a standard, but we cannot for example use the
https://github.com/cloudevents/spec/blob/v1.0.1/spec.json#L75
element which is a shame
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we allowed access to the full cloud-event, then actually we could use that, couldn't we? User could pass the following as an argument to a function, for example: ${ .data_base64 }
I actually think this implies that we get better integration w/ cloud-events if we give the user access to the full set of all attributes of the cloud-event.
but we cannot enforce that the CE message is in json format
I agree that we can't enforce what wire-format the runtime receives the cloud-event in. But I don't see why we can't say that the runtime should present any received event in the json form to the workflow.
I understand that cloud-event has lots of different protocol bindings, but afaik, none of them stop us from presenting the cloud-event as the json form of a cloud-event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorgenj ok. i will try to get some CE guy in one of our meetings so we can ask them all these questions. would it be ok if we leave it as it is for now (just data) as that part did not change, and raise an issue to change this across the board if we want soon?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created an issue: #284
rebased |
@jorgenj fixed the issue regarding state data input/output. Is there anything else that needs to be addressed per your comments? |
Results of the `input` expression should become the state data. | ||
Results of the `output` expression should become the state data output. | ||
|
||
For more information on this you can reference the [data merging](#Data-Merging) section. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit; I think it can be pretty confusing that this mentions merging when no merging is done for state data filters...
Signed-off-by: Tihomir Surdilovic <[email protected]>
Signed-off-by: Tihomir Surdilovic <[email protected]>
Signed-off-by: Tihomir Surdilovic <[email protected]>
Signed-off-by: Tihomir Surdilovic <[email protected]>
Signed-off-by: Tihomir Surdilovic <[email protected]>
Signed-off-by: Tihomir Surdilovic <[email protected]>
Signed-off-by: Tihomir Surdilovic <[email protected]>
Signed-off-by: Tihomir Surdilovic <[email protected]>
Signed-off-by: Tihomir Surdilovic <[email protected]>
Signed-off-by: Tihomir Surdilovic <[email protected]>
rebased |
Signed-off-by: Tihomir Surdilovic [email protected]
Many thanks for submitting your Pull Request ❤️!
Please specify parts this PR updates:
What this PR does / why we need it:
Updates data filter. Main changes;
"A state data filter can filter the state input and then filter the state output"
"An event data filter can filter the event data and merge it toStateData element X"
"An action data filter can filter elements fromStateData, it can also filter the action results and merge them toStateData element Y"
(the bolded words are the actual property names of the filters)
Other changes
Special notes for reviewers:
All the changes can be pretty much summarized with the following example state: