Skip to content

Document Data Flow and expression context more clearly #958

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Aug 8, 2024

Conversation

matthias-pichler
Copy link
Collaborator

@matthias-pichler matthias-pichler commented Aug 4, 2024

Signed-off-by: Matthias Pichler [email protected]

Please specify parts of this PR update:

  • Specification
  • Schema
  • Examples
  • Extensions
  • Use Cases
  • Community
  • CTK
  • Other

Discussion or Issue link:

Closes #921

What this PR does:

I extended the "Data Flow" section of the DSL with some clarifying statements and also added a diagram showing the data flow.

Additional information:

@cdavernas
Copy link
Member

Not sure that it's needed, as it's always the filtered input that is used, everywhere but in the input.from, which is the only exception.

@cdavernas cdavernas added change: fix Something isn't working. Impacts in a minor version change. change: documentation Improvements or additions to documentation. It won't impact a version change. area: spec Changes in the Specification labels Aug 4, 2024
@cdavernas cdavernas added this to the v1.0.0-alpha3 milestone Aug 4, 2024
@matthias-pichler
Copy link
Collaborator Author

Not sure that it's needed, as it's always the filtered input that is used, everywhere but in the input.from, which is the only exception.

@cdavernas should I then close #921 without any additional clarifications?

I have to admit though that I always have to think a bit what the context of a given expression is so I imagine clarifications could help. Furthermore it is not inherently clear (from just the names) that if runs after input.from ... maybe this should be added in the data flow section even though it's not data flow related 🤷

@cdavernas
Copy link
Member

No no, that's just my opinion, let's wait for others to kick in 😉

Anyways, more info cannot hurt, especially if you feel that's not especially intuitive or lacks clarification 😉

@cdavernas
Copy link
Member

maybe this should be added in the data flow section even though it's not data flow related 🤷

Maybe, yes! Along with some explicit graphical/schema representation of what comes/happens when, as @zolero pointed out to me in PMs!

@matthias-pichler matthias-pichler changed the title Specify task.if context Document Data Flow and expression context more clearly Aug 5, 2024
@matthias-pichler
Copy link
Collaborator Author

@cdavernas I removed the extra statement from the task.if description and added a general overview for runtime expressions and data flow 👍

@JBBianchi
Copy link
Member

More explanation cannot hurt, especially on a topic that seems confusing (see #890 for instance).

If I can add my two cents, I would get rid of the term "filter*" in favor of "transform*". Filter conveys the idea of removing part of the dataset where transform is broader and just implies mutations.

@matthias-pichler
Copy link
Collaborator Author

matthias-pichler commented Aug 5, 2024

If I can add my two cents, I would get rid of the term "filter*" in favor of "transform*". Filter conveys the idea of removing part of the dataset where transform is broader and just implies mutations.

I like it 👍 I also felt "Filtered" wasn't the right term. I will wait for feedback from the others and then change it accordingly.

@cdavernas
Copy link
Member

@matthias-pichler-warrify I liked the term filter, but @JBBianchi made a valid point. I'm totally OK with replacing it withtransform or whatever you guys feel fits better!

Copy link
Member

@cdavernas cdavernas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me, many thanks!

@matthias-pichler
Copy link
Collaborator Author

matthias-pichler commented Aug 5, 2024

export.as sometimes uses the undocumented $output variable (e.g. https://github.com/serverlessworkflow/specification/blob/main/dsl-reference.md#examples-31)

This argument exists because the export.as expressions is evaluated on the existing context and thus there is no other way to access the output. Should I:

a) add the $output argument to the docs and evaluate export.as on the existing context
b) change export.as to be evaluated on the transformed output and remove $output from the examples

@fjtirado
Copy link
Collaborator

fjtirado commented Aug 6, 2024

export.as sometimes uses the undocumented $output variable (e.g. https://github.com/serverlessworkflow/specification/blob/main/dsl-reference.md#examples-31)

This argument exists because the export.as expressions is evaluated on the existing context and thus there is no other way to access the output. Should I:

a) add the $output argument to the docs and evaluate export.as on the existing context b) change export.as to be evaluated on the transformed output and remove $output from the examples

I prefer option b). @cdavernas prefers option a). We can decide by throwing a coin, ping pong match or voting ;)

@matthias-pichler
Copy link
Collaborator Author

matthias-pichler commented Aug 6, 2024

I prefer option b). @cdavernas prefers option a). We can decide by throwing a coin, ping pong match or voting ;)

haha 😂 Luckily we have some more people that might break the tie @JBBianchi @ricardozanini
I am unsure to be honest but I think I also slightly prefer removing $output

a) add the $output argument to the docs and evaluate export.as on the existing context

has the advantage that the default expressions are always the identity (i.e. .)

b) change export.as to be evaluated on the transformed output and remove $output from the examples

has the advantage that one less argument is needed and realistically one will always use the output to update the context anyways

@cdavernas
Copy link
Member

As @fjtirado said, I have a priori a slight preference for a because it is a tad more agile imho. I however do not have a strong or irreversible opinion. As long as the user can use both raw and modified output, it's fine by me.

@matthias-pichler
Copy link
Collaborator Author

As long as the user can use both raw and modified output, it's fine by me.

That is a good point, because as I understand it, neither option allows this. Should I update #953 to include a $task.output property that holds the raw output?

@fjtirado
Copy link
Collaborator

fjtirado commented Aug 6, 2024

The output is, by default, the input of the next task, isnt it?
Then yes, we should be able to dump the original task output (the result of the rest invocation, for example) to the context.
Whatever option we chose we need careful documentation and example, since this is as subttle as a wargame manual ;)

@JBBianchi
Copy link
Member

a) add the $output argument to the docs and evaluate export.as on the existing context
b) change export.as to be evaluated on the transformed output and remove $output from the examples

My initial preference was option a because it makes sense for . to be the context, as it is the affected object. However, the argument for removing the extra $output argument, which only exists in export.as, is a valid point as well.

As long as the user can use both raw and modified output, it's fine by me.

From my understanding of the specification, $output refers to the raw output, and there is no way to access the "transformed" output because output and export are processed simultaneously. However, it is reasonable to consider that export might be processed after output.

If we agree with @matthias-pichler-warrify's proposal to make the raw output accessible to $task, we could have:

  • .: the transformed output
  • $context: the task context, which will be updated by the current operation
  • $task.output: the raw output of the task

This approach is acceptable. My only concern is that, in most cases, we would prefer to use the raw output to update the context rather than the transformed output, which is intended specifically for use by the next task.

@cdavernas
Copy link
Member

cdavernas commented Aug 6, 2024

From my understanding of the specification, $output refers to the raw output, and there is no way to access the "transformed" output because output and export are processed simultaneously. However, it is reasonable to consider that export might be processed after output.

That's partially incorrect imho. The raw ouput was should be supplied by $output, which is the reason why I created it. The transformed output is accessed with ..

On another note, I don't understand how you could ever evaluate two expression simultaneously. My understanding is to first evaluate output, then export, as explained in the schema @matthias-pichler-warrify has created

@matthias-pichler
Copy link
Collaborator Author

The transformed output is accessed with .

Currently it is not because . refers to the context object. Making it refer to the transformed output is what option b) proposes.

As it stands now only one of the raw or transformed output is available ... whatever $output refers to currently 🤷‍♂️

@matthias-pichler
Copy link
Collaborator Author

I think both options of simultaneous and in order evaluation are possible. Whichever we define as the intended semantics (independent of implementation).

Making them simultaneous forces export.as to use the raw output either as . or $output. I definitely think that the raw output should be available to export in some form though

@fjtirado
Copy link
Collaborator

fjtirado commented Aug 6, 2024

order evaluation are possible. Whichever we define as the intended semantics (independent of implementation).

Making them simultaneous forces export.as to use the raw output either as . or $output. I definitely think that the raw output should be available to export in some form though

I think we should impose an order to avoid ambiguity, so . is the transformed output and $task.output the original output always, regardless the implementation

@cdavernas
Copy link
Member

cdavernas commented Aug 6, 2024

Currently it is not because . refers to the context object.

That is not as it was initially intended, because unlike the raw input, the context was accessible thanks to $context. I might have documented it incorrectly though.

@matthias-pichler-warrify Based on which info did you conclude . was the context, therefore being an exception to all other evaluations?

@cdavernas
Copy link
Member

I think both options of simultaneous and in order evaluation are possible

Simultaneously makes no sense to me, but that's just my opinion. As a matter if fact, it's counter intuitive, error prone and confusing. Plus, I doubt that implementers will rely on parallel processing to do such trivial evaluations, therefore they will arbitrarily evaluate one or the other.

@JBBianchi
Copy link
Member

JBBianchi commented Aug 6, 2024

I think both options of simultaneous and in order evaluation are possible

Simultaneously makes no sense to me, but that's just my opinion. As a matter if fact, it's counter intuitive, error prone and confusing. Plus, I doubt that implementers will rely on parallel processing to do such trivial evaluations, therefore they will arbitrarily evaluate one or the other.

I didn't choose the proper wording I think. Just to be clear, the term "simultaneously" is used figuratively here, not to imply that the operations occur in parallel threads, but rather that they are independent and can be processed in any order. Historically, output.as and export.as were two operations related to output, to and from. These operations were processed "at the same moment, after the task returned its output" (not as "parallel tasks"). There was nothing preventing an implementer from processing one before the other, as there was no order of preference.

@cdavernas
Copy link
Member

cdavernas commented Aug 6, 2024

To conclude, my opinion is that we should indicate that the . of the export.as should be the transformed output, and to keep $output parameter (which is more consistent, knowing we already declare an $input paramater, too).

This would ensure we make no exception to . representing the "current" data. Proceeding otherwise might be confusing, at least it is to me.

Whatever we choose to proceed with, we should clarify that output should be transformed BEFORE exporting. This is IMHO more consistent, more intuitive and more logical.

On a side note, it's IMO perfectly fine that the info might be retrieved using $task.output, too: it just is an additional convenience.

What do you guys think?

@matthias-pichler
Copy link
Collaborator Author

matthias-pichler commented Aug 6, 2024

I think the split of output.from and output.to introduced some drift so the spec is no longer coherent. We should reason from the ground up and I will adapt the spec accordingly. I will try to consolidate the discussion:

  1. I think the transformed input should be available to the output.as and export.as expressions. The logical place for me to put it is $input while the raw input might live in $task.input (or for clarity $task.rawInput)

  2. If we keep $output it should most likely refer to the transformed output since $input also refers to the transformed input (as documented here). Otherwise it would be inconsistent for authors

  3. If we introduce $task.output it should probably refer to the raw output since Document content of $workflow and $task arguments #951 specifies the $task.input argument as the raw input

  4. export.as should definitely have access to the raw output since I might want to store something in the context that might not be relevant to the next task/output

  5. export.as should probably have access to the transformed output 🤷 ? If yes then a order is automatically imposed since output.as has to produce the transformed output first. Otherwise they still wouldn't be "independent" of each other. After closer inspection order does still matter since output.as has access to $context which is modified by export.as and thus order still matters and output.as before export.as seems to make more sense/be more intuitve.

  6. Other runtime expressions such as input.from and output.as have their evaluation context be the same as the data they are transforming. This means input.from and output.as both "default" to the identity expression. In jq this would be ${ . }. Since the export.as expression modifies the value of $context keeping this property would mean that the evaluation context of export.as should be the context itself.
    I however believe the authors will not think about export.as this way and will rather reason that output.as and export.as will either "be downstream of each other" or "siblings". So I think this point is not very strong.

I think this leaves us with the following options, that we can vote on:

Option 1

  • export.as is evaluated on the transformed output. i.e. export.as is downstream of output.as
  • Order is automatically imposed (output.as -> export.as)
  • $input is the transformed task input
  • $output does NOT need to exist
  • $task.output will refer to the raw output and will be set in both output.as (although redundant) and export.as

👍 Pro

  • One less runtime argument ($output)

👎 Con

  • referring to the raw output is more tedious with $task.output

Option 2

  • export.as is evaluated on the raw output.
  • $input is the transformed task input
  • $output does NOT exist and export.as does NOT have access to the transformed output. Thus some processing might need to be duplicated
  • Order is still imposed since output.as relies on the old $context. (output.as -> export.as)
  • If we want we can still introduce $task.output to refer to the raw output but it is no longer needed

👍 Pro

  • One less runtime argument ($output)
  • referring to the raw output is concise (with .)
  • export.as and output.as have the same evaluation context

👎 Con

  • authors might have to duplicate common transformations

Option 3

  • export.as is evaluated on the current context
  • $input is the transformed task input
  • $output needs to exist and will refer to the transformed output to be consistent with $input
  • Order is automatically imposed (output.as -> export.as)
  • $task.output will refer to the raw output and will be set in both output.as (although redundant) and export.as

👍 Pro

  • default runtime expressions all are the identity (i.e. ${ . })

👎 Con

  • referring to the raw output is more tedious with $task.output

Option 4

  • export.as is evaluated on the raw output.
  • $input is the transformed task input
  • $output does exist and contains the transformed output.
  • Order is automatically imposed (output.as -> export.as)
  • If we want we can still introduce $task.output to refer to the raw output but it is no longer needed

👍 Pro

  • No duplication for common transformations
  • referring to the raw output is concise (with .)
  • export.as and output.as have the same evaluation context

👎 Con

  • One more runtime argument ($output)

----- UPDATE ----

Option 5

  • export.as is evaluated on the transformed output.
  • $output does exist and contains the transformed output.
  • Order is automatically imposed (output.as -> export.as)
  • $task.output will refer to the raw output and will be set in both output.as (although redundant) and export.as

@matthias-pichler
Copy link
Collaborator Author

matthias-pichler commented Aug 6, 2024

I prefer option 4. 👍

default runtime expressions all are the identity (i.e. ${ . })

I don't think this is not a strong pro argument for option 3 and "purely theoretical" and according to #948 thus least priority

@cdavernas
Copy link
Member

cdavernas commented Aug 6, 2024

I prefer option #4 too, which was more or less the initial intent when writing the refactor (as has already been implemented in Synapse's alpha branch more than a month ago :p).

I don't see the problem with having "one more argument", on the contrary, as they are but a convenience.

@cdavernas
Copy link
Member

cdavernas commented Aug 6, 2024

If we want we can still introduce $task.output to refer to the raw output but it is no longer needed

Even though not required, I'd personally still add it for the sake of consistency. Plus, when writing an expression using $task, authors might find it clearer.

I however don't see it as an absolute must, so if you guys feel we should remove it, let's do it.

@JBBianchi
Copy link
Member

  1. Other runtime expressions such as input.from and output.as have their evaluation context be the same as the data they are transforming. This means input.from and output.as both "default" to the identity expression. In jq this would be ${ . }. Since the export.as expression modifies the value of $context keeping this property would mean that the evaluation context of export.as should be the context itself.
    I however believe the authors will not think about export.as this way and will rather reason that output.as and export.as will either "be downstream of each other" or "siblings". So I think this point is not very strong.

I agree with all the points, but the last point needs some nuance IMO. An important aspect to consider is that the evaluation contexts of input.from and output.as are respectively the raw input and output because those are the results of the previous step in the execution pipeline, not because of the target of the transformation. Therefore, it's a bit of a stretch to suggest that the evaluation context for export.as should be the $context. If we choose to position export.as as the step following output.as, it would be logical for its evaluation context to be the transformed output.

So to keep the consistency of the pipeline concept, I'd rather opt for Options 1 .
Maybe with a twist of using $output instead of $task.output. The later is a bit useless (output.as already have access to the raw output as its evaluation context) and I don't think it brings more clarity than $output, it's just makes it longer to type.

@matthias-pichler
Copy link
Collaborator Author

matthias-pichler commented Aug 7, 2024

@JBBianchi

Therefore, it's a bit of a stretch to suggest that the evaluation context for export.as should be the $context.

I agree. It's a weak argument I just included for completeness

Maybe with a twist of using $output instead of $task.output.

For me that is honestly a non-starter because it would be SUPER confusing to have $input be the transformed input and $output be the raw output.

I think that since export.as has access to the raw output via $task.output and the transformed output using $output I am flexible on the evaluation context of export.as being either the raw or transformed output.

Aligning with the "pipeline concept" makes sense so I updated the docs

@matthias-pichler
Copy link
Collaborator Author

So I think @cdavernas , @JBBianchi and I are aligned now 🎉

@fjtirado @ricardozanini what are your thoughts?

@fjtirado
Copy link
Collaborator

fjtirado commented Aug 7, 2024

Im a man of principles, so I still think Option1 is the way to go

@fjtirado
Copy link
Collaborator

fjtirado commented Aug 7, 2024

The use case I see for export (context) having access to the transformed output is the same one than when we want to just store the original output of the task. Maybe we have a future task (not the consecutive one) that want to do something with the transformed output. In other words, I think we cannot predict if users would like to store the original output or the transformed one in the context, it will heavily depends on the scenario. I also think forcing an order between output.as and export.as is a good thing.

@matthias-pichler
Copy link
Collaborator Author

matthias-pichler commented Aug 7, 2024

@fjtirado

Im a man of principles, so I still think Option1 is the way to go

which principle drives your opinion here? The only difference between option 1 and what is currently documented is that we define $output additionally

I also think forcing an order between output.as and export.as is a good thing.

Absolutely 👍 but as I stated in my summary. Order is always output.as before export.as (in all options). This is because if output.as uses $context we need to define that it should refer to the "old" context and not the one export.as produces 👍 This is currently already documented this way in this PR

I think we cannot predict if users would like to store the original output or the transformed one in the context

I agree. Authors have to have access to both raw & transformed output in the export.as expressions. Both Option 1 and the current agreement allow this 👍

To summarize:
The only difference between Option 1 and what we currently have (option 5) is that $output is defined IN ADDITION (even though redundant). And thus my question is: Are you against defining $output additionally?

@matthias-pichler
Copy link
Collaborator Author

matthias-pichler commented Aug 7, 2024

I will summarize what we have right now because it does differ from option 4. It is basically option 1 with $output defined

this PR

  • export.as is evaluated on the transformed output.
  • $output does exist and contains the transformed output.
  • Order is automatically imposed (output.as -> export.as)
  • $task.output will refer to the raw output and will be set in both output.as (although redundant) and export.as

I added it above as option 5

@ricardozanini
Copy link
Member

I didn't go through our examples, but if we need something that documents the transformation, we should add it to this PR.

@matthias-pichler
Copy link
Collaborator Author

I didn't go through our examples, but if we need something that documents the transformation, we should add it to this PR.

I went through all examples and ctk cases and didn't find anything that needed updating. The one example that used export.as was added by me and already used the current spec 😅

But thanks to your reminder I found one example in the dsl-reference that needed updating 🙏

@cdavernas cdavernas merged commit 1e50891 into serverlessworkflow:main Aug 8, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: spec Changes in the Specification change: documentation Improvements or additions to documentation. It won't impact a version change. change: fix Something isn't working. Impacts in a minor version change.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Specify context of task.if expression
5 participants