Updating data filters #278

tsurdilo · 2021-03-03T04:53:10Z

Signed-off-by: Tihomir Surdilovic [email protected]

Many thanks for submitting your Pull Request ❤️!

Please specify parts this PR updates:

What this PR does / why we need it:
Updates data filter. Main changes;

adds ability to specify the state data elements onto which the event data / action results should be merged into. Before this it was assumed it was the entire state data
updates data filter properties to follow a natural language naming convention. For example:
"A state data filter can filter the state input and then filter the state output"
"An event data filter can filter the event data and merge it toStateData element X"
"An action data filter can filter elements fromStateData, it can also filter the action results and merge them toStateData element Y"
(the bolded words are the actual property names of the filters)
added rules for data merging for elements of type object, array, string, and number
Other changes
restructured the workflow data section to fit together
added a new image showing relationship between control flow logic and data flow through workflow execution
fixes some descriptions in the json schemas
fixes errors and updated text in the workflow data sections

Special notes for reviewers:
All the changes can be pretty much summarized with the following example state:

        {
            "name": "WaitForCustomerToArrive",
            "type": "event",
            "onEvents": [{
                "eventRefs": ["CustomerArrivesEvent"],
                "eventFilter": {
                    "data": "${ .customer }",
                    "toStateData": "${ .customerInfo }"
                },
                "actions":[
                    {
                        "functionRef": {
                            "refName": "greetingFunction",
                            "arguments": {
                                "greeting": "${ .spanish } ",
                                "customerName": "${ .customerInfo.name } "
                            }
                        },
                        "actionFilter": {
                            "fromStateData": "${ greetings.hello }",
                            "results": "${ .greetingMessageResult }",
                            "toStateData": "${ .finalCustomerGreeting }"
                        }
                    }
                ]
            }],
            "stateFilter": {
                "input": "${ .greetings } ",
                "output": "${ .finalCustomerGreeting }"
            },
            "end": true
        }
  

**Additional information (if needed):**

tsurdilo · 2021-03-03T04:55:05Z

Here is the parsed workflow data section: https://github.com/tsurdilo/specification/blob/updatefilters2/specification.md#Workflow-Data

schema/workflow.json

specification.md

jorgenj · 2021-03-03T16:44:45Z

comparisons/comparison-cadence.md

-    actionDataFilter:
-      dataResultsPath: "${ .processed }"
+    actionFilter:
+      results: "${ .processed }"


Nit; if I'm understanding correctly, in actionFilter it's 'results', in stateFilter it's 'output' and in eventFilter it's 'data'? Should we try to use similar names for similar concepts?

I think naming them the same thing is more confusing honestly. As mentioned in pr description this uses natural language concept which imo is better:
"A state data filter can filter the state input and then filter the state output"
"An event data filter can filter the event data and merge it toStateData element X"
"An action data filter can filter elements fromStateData, it can also filter the action results and merge them toStateData element Y"

It's a nit; I'm fine if we disagree on this so I'll only add that for actions could be considered to have 'output'.

Also, fwiw, that description appears to be referring to the wrong filter in the 3rd line. It appears to be actionDataFilter but says 'An event filter'.

oops ok will fix. sorry.
so actions should have "results" / "result" or "output. I am fine with any of those. Just let me know what makes most sense

If nobody else weighs in, let's just stick with results I guess.

examples/README.md

jorgenj · 2021-03-03T16:56:47Z

schema/workflow.json

+        },
+        "toStateData": {
+          "type": "string",
+          "description": " Workflow expression that selects a state data element to which the event payload should be added/merged into. If not specified denotes the top-level state data element."


This bit is confusing. Should this be a path and not an expression? Treating this as an expression implies that a user could do something like the following here: ${ .artObjects[] | select(.productionPlaces | length >= 1) | .id }. Since this is supposed to tell the runtime where to merge the output, what should a runtime do with this expression?

By path, I mean something like .data vs. an expression like ${ .data }.

could change to "selection expression" . would that be clearer?

That could work, I'd be interested to hear others thoughts if they have any.

I think the really confusing part is that the examples all have '${ .data }' which to me implies something to be extracted from the stateData, rather than indicating a location in the state-data where the results should be merged.

Do you mean the "data" element? This comes from the CE format where the event payload is in the "data" context attribute. and most of the time "data" is the top-level property

I'm sorry, my question isn't very clear.

"toStateData": ".foo"

This looks like exactly what it is, something indicating a location in a json structure where results should be merged into the state data.

"toStateData": "${ .foo }"

This is completely unclear, to the uninitiated, what this is supposed to do. Is it extracting some value from .foo, and that value will tell the runtime where to merge the results? Or, is the runtime supposed to use '.foo' as the location where the merge should occur?

The ${} indicates to the reader that some interpolation or evaluation of that expression is to occur, and the results of that expression are what should actually be used so expressing toStateData in this way is extremely confusing.

i do think i see your point. each expression has a result so toStateData might not fit maybe?
the problem i have is then if its not an expression, but the way you have it ".foo" is still an expression ....
the intent is to say "merge into the "foo" element of the state data and create one if it does not exist"

I think we should create an issue and bring it up in a meeting to see opinions but I think for this pr its fine

schema/workflow.json

specification.md

jorgenj · 2021-03-03T19:41:48Z

specification.md

+When merging, the state data element and the data (payload)/action result should have the same type, meaning
+that you should not merge arrays with objects or objects with arrays etc. 
+
+When merging elements of type object should be done by inserting all the key-value pairs from both objects into


I was thinking more about the merge strategy this morning and I have a concern that this definition of 'merge' is effectively accumulation only. Meaning, there doesn't seem to be a way for a user to remove data they no longer need from the workflow data.

This is a problem for runtimes because for performance reasons most will want to enforce a max size for the workflow data.

I would propose that this only supports a 'shallow' merge, in that we take the output of an action (for example) and place it in the workflow data at the path specified by the action-data-filter, but we don't merge any existing data from that same path into the results we just over-write it.

For example:
workflow data

{ d1: { foo: [] } d2: { bar: { baz: 1 } } }

Action data filter:

{ "fromStateData": "${ .d1 }", "results": "${ .results }", "toStateData": ".d2" }

The action returns:

{ results: {} }

The resulting state data after the action-data-filter is applied would be:

{ d1: { foo: [] } d2: {} }

This has the benefit for workflows where the use-case is simply to pass data between some actions and not accumulate all results from all actions in the workflow, then the action-data-filter could specify:

{ "fromStateData": "${ . }", "results": "${ .results }", "toStateData": "." }

This would have the impact of passing the current state-data as input to the action, and over-writing the whole state-data with the output of the action.

the main driver to keep the workflow data as small as possible (for performance) is the state filters
i think a good practice for big event data/ large function results would be to first trim them down using event/action filters before merging them to state data

This deep-merging behavior is complex, and leads to failure modes that users won't expect like being surprised by a runtime error when they merge 2 data-structures that overlap, but some of the data-types are different (array vs. object vs string etc).

I suspect most users would naturally expect a shallow merge and that very few users would actually want/need deep-merging. Further, it adds new failure modes that the users have to watch for (like the disparate types problem) and adds a lot of complexity for runtimes to implement.

Finally, if a user really wants to deep-merge some data they've collected from multiple actions, if we were to only support shallow-merge then they can still do the deep-merge via fromStateData before passing the merged data to an action.

ok that sounds reasonable

This still seems to be describing a somewhat-deep merge. I'm confused on whether we've resolved this or not.

I'd like to propose the following (happy to move this to a follow-up issue if you prefer):

Support simple insert, no merging.

Add the ability for customers to reference the state data in their actionDataFilter.results or eventDataFilter.data expressions, which gives them the power to specify custom 'merge' behavior.

WHY?

Because this is much simpler to understand, there are no surprises about deep or shallow merging its just a simple insert by default. It also lets users leverage the power of JQ to do custom 'merge' strategies.

Consider the example:

results: ${.} toStateData: .customer

This would simply insert the action result at the location .customer in state data. If something already exists there in state data, this would completely over-write that data.

What if users really want to do merge?

We can give users much more flexibility, though it requires that the runtime make the state data available as a pre-defined jq variable.

With that we can give the user all kinds of flexibility.

(These examples assume the action result, referenced by . is a customer object)

Here's an example where we do a simple merge via actionDataFilter:

results: ${$stateData.customer + .} toStateData: .customer

The above example takes customer from stateData and shallow merges it with the action result, reassigning the results back to .customer in the state data.

Here's an example where we do a recursive merge via actionDataFilter:

results: ${$stateData.customer * .} toStateData: .customer

The above example takes customer from stateData and recursively merges it with the action result, reassigning the results back to .customer in the state data.

I went ahead and made a separate issue for this: #286

tsurdilo · 2021-03-04T18:52:30Z

@ricardozanini @jorgenj
I applied almost all your suggestions, let me know if there is anything left outstanding and ty again for your reviews on this

specification.md

jorgenj · 2021-03-04T19:10:03Z

specification.md

@@ -480,7 +480,7 @@ into. With this, after our action executes the state data would be:

 | Parameter | Description | Type | Required |
 | --- | --- | --- | --- |
-| data | Workflow expression that filters of the event data (payload) | string | no |
+| data | Workflow expression that filters the event data (payload) | string | no |


Does this imply that they can only access the 'data' portion of the event and not other attributes of the event?

the context attributes of the CE are to be used when you define the event (event definition) and when you define your correlations. When state consumes that defined event we are dealing with its data (payload) only and we require it to be json format. given that the CE event itself is not always json format I think that is reasonable. wdyt?

Just a brief skim of CE, the first thing I notice is an ID attribute that is a top-level attribute of the CE. It's supposed to uniquely identify the event. With this behavior, it's impossible for a workflow to reference that ID, right?

The subject and time attributes are other top-level attributes of a CE that seem like they might be useful for a workflow.

the event that you consume is defined in the event definition and the correlation.
a CE over http can look like this:

POST /someresource HTTP/1.1
Host: webhook.example.com
ce-specversion: 1.0
ce-type: com.example.someevent
ce-time: 2018-04-05T03:56:24Z
ce-id: 1234-1234-1234
ce-source: /mycontext/subcontext
.... further attributes ...
Content-Type: application/json; charset=utf-8
Content-Length: nnnn

{
... application data ...
}

if you want those data to be available put them in the payload of the event that is json

That is just an example given in the CE spec doc that is written in json. CE has a number of defined protocol bindings,
https://github.com/cloudevents/spec
see the *-protocol-bindings.md documents

we can enforce CE for defining what events workflow needs with the required context attributes, but we cannot enforce that the CE message is in json format.
the runtimes typically get the entire event (whatever protocol/format it comes in) and can add all the relevant information, but from our point of view how can we assure that every single one does it?

there are limitations ..."data" being json is pretty much a standard, but we cannot for example use the
https://github.com/cloudevents/spec/blob/v1.0.1/spec.json#L75
element which is a shame

If we allowed access to the full cloud-event, then actually we could use that, couldn't we? User could pass the following as an argument to a function, for example: ${ .data_base64 }

I actually think this implies that we get better integration w/ cloud-events if we give the user access to the full set of all attributes of the cloud-event.

but we cannot enforce that the CE message is in json format

I agree that we can't enforce what wire-format the runtime receives the cloud-event in. But I don't see why we can't say that the runtime should present any received event in the json form to the workflow.

I understand that cloud-event has lots of different protocol bindings, but afaik, none of them stop us from presenting the cloud-event as the json form of a cloud-event.

@jorgenj ok. i will try to get some CE guy in one of our meetings so we can ask them all these questions. would it be ok if we leave it as it is for now (just data) as that part did not change, and raise an issue to change this across the board if we want soon?

Created an issue: #284

tsurdilo · 2021-03-05T02:59:19Z

rebased

tsurdilo · 2021-03-05T21:50:49Z

@jorgenj fixed the issue regarding state data input/output. Is there anything else that needs to be addressed per your comments?

jorgenj · 2021-03-05T22:15:43Z

specification.md

+Results of the `input` expression should become the state data. 
+Results of the `output` expression should become the state data output.
+
+For more information on this you can reference the [data merging](#Data-Merging) section.


Nit; I think it can be pretty confusing that this mentions merging when no merging is done for state data filters...

Signed-off-by: Tihomir Surdilovic <[email protected]>

tsurdilo · 2021-03-09T03:19:11Z

rebased

tsurdilo requested a review from ricardozanini as a code owner March 3, 2021 04:53

tsurdilo added examples area: spec Changes in the Specification labels Mar 3, 2021

tsurdilo added this to the v0.6 milestone Mar 3, 2021

tsurdilo linked an issue Mar 3, 2021 that may be closed by this pull request

Update action and event data filters #277

Closed

tsurdilo mentioned this pull request Mar 3, 2021

Update action and event data filters #277

Closed

ricardozanini approved these changes Mar 3, 2021

View reviewed changes

tsurdilo mentioned this pull request Mar 3, 2021

New example - filling up a glass of water #279

Merged

7 tasks

jorgenj reviewed Mar 3, 2021

View reviewed changes

jorgenj reviewed Mar 4, 2021

View reviewed changes

specification.md Outdated Show resolved Hide resolved

jorgenj reviewed Mar 4, 2021

View reviewed changes

tsurdilo force-pushed the updatefilters2 branch from bf7a623 to 6652b40 Compare March 5, 2021 02:59

jorgenj reviewed Mar 5, 2021

View reviewed changes

jorgenj approved these changes Mar 5, 2021

View reviewed changes

Tihomir Surdilovic added 10 commits March 8, 2021 22:08

Updating data filters

047bf12

Signed-off-by: Tihomir Surdilovic <[email protected]>

fix sentence

deda3ec

Signed-off-by: Tihomir Surdilovic <[email protected]>

fix sentence

46050a2

Signed-off-by: Tihomir Surdilovic <[email protected]>

fix sentence

47b76d1

Signed-off-by: Tihomir Surdilovic <[email protected]>

fix sentence

41de2fb

Signed-off-by: Tihomir Surdilovic <[email protected]>

fixes per Ricardos comments

9894d7f

Signed-off-by: Tihomir Surdilovic <[email protected]>

fixes per Ricardos comments

efdf1bd

Signed-off-by: Tihomir Surdilovic <[email protected]>

Fixes per Jorgens comments

1624f99

Signed-off-by: Tihomir Surdilovic <[email protected]>

Fixes per Jorgens comments

0afd754

Signed-off-by: Tihomir Surdilovic <[email protected]>

Fixes per Jorgens comments

578d516

Signed-off-by: Tihomir Surdilovic <[email protected]>

tsurdilo force-pushed the updatefilters2 branch from df116c9 to 578d516 Compare March 9, 2021 03:18

tsurdilo merged commit 996941d into serverlessworkflow:master Mar 9, 2021

Updating data filters #278

Updating data filters #278

Uh oh!

Conversation

tsurdilo commented Mar 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tsurdilo commented Mar 3, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tsurdilo Mar 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorgenj Mar 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tsurdilo Mar 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tsurdilo commented Mar 4, 2021

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tsurdilo Mar 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

tsurdilo commented Mar 3, 2021 •

edited

Loading

tsurdilo Mar 3, 2021 •

edited

Loading

jorgenj Mar 4, 2021 •

edited

Loading

tsurdilo Mar 3, 2021 •

edited

Loading

tsurdilo Mar 4, 2021 •

edited

Loading