Skip to content

Updating data filters #278

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Mar 9, 2021
Merged

Conversation

tsurdilo
Copy link
Contributor

@tsurdilo tsurdilo commented Mar 3, 2021

Signed-off-by: Tihomir Surdilovic [email protected]

Many thanks for submitting your Pull Request ❤️!

Please specify parts this PR updates:

  • [ x] Specification
  • [ x] Schema
  • [ x] Examples
  • Extensions
  • [x ] Roadmap
  • Use Cases
  • Community
  • TCK
  • Other

What this PR does / why we need it:
Updates data filter. Main changes;

  1. adds ability to specify the state data elements onto which the event data / action results should be merged into. Before this it was assumed it was the entire state data
  2. updates data filter properties to follow a natural language naming convention. For example:
    "A state data filter can filter the state input and then filter the state output"
    "An event data filter can filter the event data and merge it toStateData element X"
    "An action data filter can filter elements fromStateData, it can also filter the action results and merge them toStateData element Y"
    (the bolded words are the actual property names of the filters)
  3. added rules for data merging for elements of type object, array, string, and number
    Other changes
  4. restructured the workflow data section to fit together
  5. added a new image showing relationship between control flow logic and data flow through workflow execution
  6. fixes some descriptions in the json schemas
  7. fixes errors and updated text in the workflow data sections

Special notes for reviewers:
All the changes can be pretty much summarized with the following example state:

        {
            "name": "WaitForCustomerToArrive",
            "type": "event",
            "onEvents": [{
                "eventRefs": ["CustomerArrivesEvent"],
                "eventFilter": {
                    "data": "${ .customer }",
                    "toStateData": "${ .customerInfo }"
                },
                "actions":[
                    {
                        "functionRef": {
                            "refName": "greetingFunction",
                            "arguments": {
                                "greeting": "${ .spanish } ",
                                "customerName": "${ .customerInfo.name } "
                            }
                        },
                        "actionFilter": {
                            "fromStateData": "${ greetings.hello }",
                            "results": "${ .greetingMessageResult }",
                            "toStateData": "${ .finalCustomerGreeting }"
                        }
                    }
                ]
            }],
            "stateFilter": {
                "input": "${ .greetings } ",
                "output": "${ .finalCustomerGreeting }"
            },
            "end": true
        }
  

**Additional information (if needed):**

@tsurdilo tsurdilo requested a review from ricardozanini as a code owner March 3, 2021 04:53
@tsurdilo tsurdilo added examples area: spec Changes in the Specification labels Mar 3, 2021
@tsurdilo tsurdilo added this to the v0.6 milestone Mar 3, 2021
@tsurdilo tsurdilo linked an issue Mar 3, 2021 that may be closed by this pull request
@tsurdilo
Copy link
Contributor Author

tsurdilo commented Mar 3, 2021

actionDataFilter:
dataResultsPath: "${ .processed }"
actionFilter:
results: "${ .processed }"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit; if I'm understanding correctly, in actionFilter it's 'results', in stateFilter it's 'output' and in eventFilter it's 'data'? Should we try to use similar names for similar concepts?

Copy link
Contributor Author

@tsurdilo tsurdilo Mar 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think naming them the same thing is more confusing honestly. As mentioned in pr description this uses natural language concept which imo is better:
"A state data filter can filter the state input and then filter the state output"
"An event data filter can filter the event data and merge it toStateData element X"
"An action data filter can filter elements fromStateData, it can also filter the action results and merge them toStateData element Y"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a nit; I'm fine if we disagree on this so I'll only add that for actions could be considered to have 'output'.

Also, fwiw, that description appears to be referring to the wrong filter in the 3rd line. It appears to be actionDataFilter but says 'An event filter'.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops ok will fix. sorry.
so actions should have "results" / "result" or "output. I am fine with any of those. Just let me know what makes most sense

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If nobody else weighs in, let's just stick with results I guess.

},
"toStateData": {
"type": "string",
"description": " Workflow expression that selects a state data element to which the event payload should be added/merged into. If not specified denotes the top-level state data element."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bit is confusing. Should this be a path and not an expression? Treating this as an expression implies that a user could do something like the following here: ${ .artObjects[] | select(.productionPlaces | length >= 1) | .id }. Since this is supposed to tell the runtime where to merge the output, what should a runtime do with this expression?

By path, I mean something like .data vs. an expression like ${ .data }.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could change to "selection expression" . would that be clearer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That could work, I'd be interested to hear others thoughts if they have any.

I think the really confusing part is that the examples all have '${ .data }' which to me implies something to be extracted from the stateData, rather than indicating a location in the state-data where the results should be merged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean the "data" element? This comes from the CE format where the event payload is in the "data" context attribute. and most of the time "data" is the top-level property

Copy link
Contributor

@jorgenj jorgenj Mar 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, my question isn't very clear.

   "toStateData": ".foo"

This looks like exactly what it is, something indicating a location in a json structure where results should be merged into the state data.

   "toStateData": "${ .foo }"

This is completely unclear, to the uninitiated, what this is supposed to do. Is it extracting some value from .foo, and that value will tell the runtime where to merge the results? Or, is the runtime supposed to use '.foo' as the location where the merge should occur?

The ${} indicates to the reader that some interpolation or evaluation of that expression is to occur, and the results of that expression are what should actually be used so expressing toStateData in this way is extremely confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i do think i see your point. each expression has a result so toStateData might not fit maybe?
the problem i have is then if its not an expression, but the way you have it ".foo" is still an expression ....
the intent is to say "merge into the "foo" element of the state data and create one if it does not exist"

I think we should create an issue and bring it up in a meeting to see opinions but I think for this pr its fine

When merging, the state data element and the data (payload)/action result should have the same type, meaning
that you should not merge arrays with objects or objects with arrays etc.

When merging elements of type object should be done by inserting all the key-value pairs from both objects into
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking more about the merge strategy this morning and I have a concern that this definition of 'merge' is effectively accumulation only. Meaning, there doesn't seem to be a way for a user to remove data they no longer need from the workflow data.

This is a problem for runtimes because for performance reasons most will want to enforce a max size for the workflow data.

I would propose that this only supports a 'shallow' merge, in that we take the output of an action (for example) and place it in the workflow data at the path specified by the action-data-filter, but we don't merge any existing data from that same path into the results we just over-write it.

For example:
workflow data

{ 
  d1: {
    foo: []
  }
  d2: {
    bar: {
       baz: 1
    }
  }
}

Action data filter:

{
   "fromStateData": "${ .d1 }",
   "results": "${ .results }",
   "toStateData": ".d2"
}

The action returns:

{ results: {} }

The resulting state data after the action-data-filter is applied would be:

{ 
  d1: {
    foo: []
  }
  d2: {}
}

This has the benefit for workflows where the use-case is simply to pass data between some actions and not accumulate all results from all actions in the workflow, then the action-data-filter could specify:

{
   "fromStateData": "${ . }",
   "results": "${ .results }",
   "toStateData": "."
}

This would have the impact of passing the current state-data as input to the action, and over-writing the whole state-data with the output of the action.

Copy link
Contributor Author

@tsurdilo tsurdilo Mar 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the main driver to keep the workflow data as small as possible (for performance) is the state filters
i think a good practice for big event data/ large function results would be to first trim them down using event/action filters before merging them to state data

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This deep-merging behavior is complex, and leads to failure modes that users won't expect like being surprised by a runtime error when they merge 2 data-structures that overlap, but some of the data-types are different (array vs. object vs string etc).

I suspect most users would naturally expect a shallow merge and that very few users would actually want/need deep-merging. Further, it adds new failure modes that the users have to watch for (like the disparate types problem) and adds a lot of complexity for runtimes to implement.

Finally, if a user really wants to deep-merge some data they've collected from multiple actions, if we were to only support shallow-merge then they can still do the deep-merge via fromStateData before passing the merged data to an action.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok that sounds reasonable

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still seems to be describing a somewhat-deep merge. I'm confused on whether we've resolved this or not.

I'd like to propose the following (happy to move this to a follow-up issue if you prefer):

  • Support simple insert, no merging.
  • Add the ability for customers to reference the state data in their actionDataFilter.results or eventDataFilter.data expressions, which gives them the power to specify custom 'merge' behavior.

WHY?

Because this is much simpler to understand, there are no surprises about deep or shallow merging its just a simple insert by default. It also lets users leverage the power of JQ to do custom 'merge' strategies.

Consider the example:

results: ${.}
toStateData: .customer

This would simply insert the action result at the location .customer in state data. If something already exists there in state data, this would completely over-write that data.

What if users really want to do merge?

We can give users much more flexibility, though it requires that the runtime make the state data available as a pre-defined jq variable.

With that we can give the user all kinds of flexibility.

(These examples assume the action result, referenced by . is a customer object)

Here's an example where we do a simple merge via actionDataFilter:

results: ${$stateData.customer + .}
toStateData: .customer

The above example takes customer from stateData and shallow merges it with the action result, reassigning the results back to .customer in the state data.

Here's an example where we do a recursive merge via actionDataFilter:

results: ${$stateData.customer * .}
toStateData: .customer

The above example takes customer from stateData and recursively merges it with the action result, reassigning the results back to .customer in the state data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and made a separate issue for this: #286

@tsurdilo
Copy link
Contributor Author

tsurdilo commented Mar 4, 2021

@ricardozanini @jorgenj
I applied almost all your suggestions, let me know if there is anything left outstanding and ty again for your reviews on this

@@ -480,7 +480,7 @@ into. With this, after our action executes the state data would be:

| Parameter | Description | Type | Required |
| --- | --- | --- | --- |
| data | Workflow expression that filters of the event data (payload) | string | no |
| data | Workflow expression that filters the event data (payload) | string | no |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this imply that they can only access the 'data' portion of the event and not other attributes of the event?

Copy link
Contributor Author

@tsurdilo tsurdilo Mar 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the context attributes of the CE are to be used when you define the event (event definition) and when you define your correlations. When state consumes that defined event we are dealing with its data (payload) only and we require it to be json format. given that the CE event itself is not always json format I think that is reasonable. wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a brief skim of CE, the first thing I notice is an ID attribute that is a top-level attribute of the CE. It's supposed to uniquely identify the event. With this behavior, it's impossible for a workflow to reference that ID, right?

The subject and time attributes are other top-level attributes of a CE that seem like they might be useful for a workflow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the event that you consume is defined in the event definition and the correlation.
a CE over http can look like this:

POST /someresource HTTP/1.1
Host: webhook.example.com
ce-specversion: 1.0
ce-type: com.example.someevent
ce-time: 2018-04-05T03:56:24Z
ce-id: 1234-1234-1234
ce-source: /mycontext/subcontext
.... further attributes ...
Content-Type: application/json; charset=utf-8
Content-Length: nnnn

{
... application data ...
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you want those data to be available put them in the payload of the event that is json

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is just an example given in the CE spec doc that is written in json. CE has a number of defined protocol bindings,
https://github.com/cloudevents/spec
see the *-protocol-bindings.md documents

we can enforce CE for defining what events workflow needs with the required context attributes, but we cannot enforce that the CE message is in json format.
the runtimes typically get the entire event (whatever protocol/format it comes in) and can add all the relevant information, but from our point of view how can we assure that every single one does it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are limitations ..."data" being json is pretty much a standard, but we cannot for example use the
https://github.com/cloudevents/spec/blob/v1.0.1/spec.json#L75
element which is a shame

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we allowed access to the full cloud-event, then actually we could use that, couldn't we? User could pass the following as an argument to a function, for example: ${ .data_base64 }

I actually think this implies that we get better integration w/ cloud-events if we give the user access to the full set of all attributes of the cloud-event.

but we cannot enforce that the CE message is in json format

I agree that we can't enforce what wire-format the runtime receives the cloud-event in. But I don't see why we can't say that the runtime should present any received event in the json form to the workflow.

I understand that cloud-event has lots of different protocol bindings, but afaik, none of them stop us from presenting the cloud-event as the json form of a cloud-event.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorgenj ok. i will try to get some CE guy in one of our meetings so we can ask them all these questions. would it be ok if we leave it as it is for now (just data) as that part did not change, and raise an issue to change this across the board if we want soon?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created an issue: #284

@tsurdilo
Copy link
Contributor Author

tsurdilo commented Mar 5, 2021

rebased

@tsurdilo
Copy link
Contributor Author

tsurdilo commented Mar 5, 2021

@jorgenj fixed the issue regarding state data input/output. Is there anything else that needs to be addressed per your comments?

Results of the `input` expression should become the state data.
Results of the `output` expression should become the state data output.

For more information on this you can reference the [data merging](#Data-Merging) section.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit; I think it can be pretty confusing that this mentions merging when no merging is done for state data filters...

Tihomir Surdilovic added 10 commits March 8, 2021 22:08
Signed-off-by: Tihomir Surdilovic <[email protected]>
Signed-off-by: Tihomir Surdilovic <[email protected]>
Signed-off-by: Tihomir Surdilovic <[email protected]>
Signed-off-by: Tihomir Surdilovic <[email protected]>
Signed-off-by: Tihomir Surdilovic <[email protected]>
Signed-off-by: Tihomir Surdilovic <[email protected]>
Signed-off-by: Tihomir Surdilovic <[email protected]>
Signed-off-by: Tihomir Surdilovic <[email protected]>
Signed-off-by: Tihomir Surdilovic <[email protected]>
Signed-off-by: Tihomir Surdilovic <[email protected]>
@tsurdilo
Copy link
Contributor Author

tsurdilo commented Mar 9, 2021

rebased

@tsurdilo tsurdilo merged commit 996941d into serverlessworkflow:master Mar 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: spec Changes in the Specification
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update action and event data filters
3 participants