Skip to content

Add multi retryRef to action #681

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lsytj0413 opened this issue Sep 15, 2022 · 19 comments · Fixed by #820
Closed

Add multi retryRef to action #681

lsytj0413 opened this issue Sep 15, 2022 · 19 comments · Fixed by #820
Assignees
Labels
change: feature New feature or request. Impacts in a minor version change
Milestone

Comments

@lsytj0413
Copy link
Contributor

lsytj0413 commented Sep 15, 2022

What would you like to be added:

In current version, there is only one retry strategy can be assign to action:

"actions": [
            {
              "functionRef": "StorePatient",
              "retryRef": "ServicesNotAvailableRetryStrategy",
              "retryableErrors": ["ServiceNotAvailable"]
            }]

 "retries": [
    {
      "name": "ServicesNotAvailableRetryStrategy",
      "delay": "PT3S",
      "maxAttempts": 10
    }
  ]

If a action can reference multi retry strategy it will be useful.

Why is this needed:

  • amazon states language support multi retry strategy https://states-language.net/spec.html#errors
  • If the action failed, user may want use different retry strategy, for example:
    • delay 1s if the error is short-term(ex: FlowLimit)
    • delay 5s if the error is long-term(ex: database is not available)
@lsytj0413 lsytj0413 changed the title A调度 Add multi retryRef to action Sep 15, 2022
@ricardozanini
Copy link
Member

This is interesting. I think it can be useful to add the retriable error in the retry definition as Amazon States Language does, or it will be difficult to understand the mapping of two arrays. Technically, it will be the same, but it will improve understanding.

Can be something like this:

{
   "actions":[
      {
         "functionRef":"StorePatient",
         "retryRefs":[
            "ServicesNotAvailableRetryStrategy",
            "NetworkIssue"
         ]
      }
   ],
   "retries":[
      {
         "name":"ServicesNotAvailableRetryStrategy",
         "delay":"PT3S",
         "maxAttempts":10,
         "errors":[
            "ServiceNotAvailable"
         ]
      },
      {
         "name":"NetworkIssue",
         "delay":"PT10S",
         "maxAttempts":3,
         "errors":[
            "ConnectionFailed"
         ]
      }
   ]
}

@lsytj0413
Copy link
Contributor Author

Currently,the retryableErrors is in Action's definition.

@tsurdilo tsurdilo added the change: feature New feature or request. Impacts in a minor version change label Sep 16, 2022
@tsurdilo tsurdilo added this to the v0.9 milestone Sep 16, 2022
@tsurdilo
Copy link
Contributor

tsurdilo commented Sep 16, 2022

I think this would be nice feature to add (even tho possibly very hard to implement). Also if I understand it right it targets idempotent actions where the first error returned determines the retry policy for the remainder of retries, and is not a dynamic retry policy recalculated based on each retry and errors that happened so far (more on that below).

Looking at the mentioned AWS samples:

"Retry" : [
    {
      "ErrorEquals": [ "States.Timeout" ],
      "MaxAttempts": 0
    },
    {
      "ErrorEquals": [ "States.ALL" ]
    }
]

I think this can be implemented already by setting States.Timeout as a non-retryable error.

"Retry": [
    {
      "ErrorEquals": [ "ErrorA", "ErrorB" ],
      "IntervalSeconds": 1,
      "BackoffRate": 2,
      "MaxAttempts": 2
    },
    {
      "ErrorEquals": [ "ErrorC" ],
      "IntervalSeconds": 5
    }
  ]

This is what we are talking about I think. I'd like to know more on how they are handling this when different errors have different maxAttempt values. Think this can run into a number of edge cases that would need to be clearly described.
Don't think they describe it in the linked doc but this i think would require runtimes to deal with X number of retry policies depending on error type, and in case on non-idempotent actions idk what would do if lets say action throws a different error type on each retry..things can get pretty complicated imo.
Maybe what they say is that the retry definition is set on first action error, and then is picked for the duration of all action retries. Then I think this would be much easier to deal with, but if you have to recalculate all retry policies on each retry based on error, this is daunting task imo. This however then dictates idempotent actions which we would need to make a "must" in the spec, need to think about it.

@lsytj0413
Copy link
Contributor Author

Can you give more example on not useful for idempotent actions?

I think this would be nice feature to add (even tho possibly very hard to implement). Also I think would not be very useful for idempotent actions.

imo the specification should define how to handing maxAttempt when different errors, like use the value based on error. For example:

  1. we have an action named CreateDisk, and the action maybe failed with FlowLimitInsufficientResource
  2. When error is InsufficientResource, we only retry for 3 times
  3. When error is FlowLimit, we retry for 30-times

@tsurdilo
Copy link
Contributor

tsurdilo commented Sep 16, 2022

Yes I updated some text in my answer above (please look again).

The retry with this is tied to an error type. In this case we can either assume that if the first invocation raises FlowLimit its retried up to 30 times regardless of errors raised by consequent retries.
Or you would re-calculate the retry policy if the second retry throws InsufficientResource.

I think this is doable iff we go with the first assumption (which i think is what aws does too but not sure, let me know if you know). wdyt.

@lsytj0413
Copy link
Contributor Author

lsytj0413 commented Sep 16, 2022

I don't known how AWS does it, but in my company's workflow engine(base on ASL),it will use the value based on error and record the retry history. For example: the CreateDisk may retry upon 30+ times when error is mixed with FlowLimit & InsufficientResource.

I think this is doable iff we go with the first assumption (which i think is what aws does too but not sure, let me know if you know). wdyt.

@tsurdilo
Copy link
Contributor

tsurdilo commented Sep 16, 2022

Just to add as a general info thing, imo it is an anti-pattern to limit retries via MaxAttempt (outside test scenarios).
For most use cases you should limit your overall action retries using state timeout (thats what its designed for at least imo :) )

@lsytj0413
Copy link
Contributor Author

Just to add as a general info thing, imo it is an anti-pattern to limit retries via MaxAttempt (outside test scenarios). For most use cases you should limit your overall action retries using state timeout (thats what its designed for at least imo :) )

Yep,but MaxAttempt is useful when we want fast-to-fail if we support multi retry strategy.

@tsurdilo
Copy link
Contributor

tsurdilo commented Sep 16, 2022

but MaxAttempt is useful when we want fast-to-fail if we support multi retry strategy.

I think this really depends on the action timeout (single action execution limit). If you set the action timeout to a large value, and your action is "stuck" for minutes or hours you cannot fail fast even with just 2 retries.

The best way to deal with your use cases is to have your actions "heartbeat" meaning report progress and have a heartbeat timeout set to a very small value within which the action has the report its progress/heartbeat. That way if you detect this heartbeat timeout you can fail the action and do your retry fast.

This however idk if we can impl in our DSL as not sure we can enforce heartbeating to all runtimes.

@tsurdilo
Copy link
Contributor

Overall +1 on adding this, just would like to maybe get one or two examples on how this would look like if its done via the mentioned dynamic retry policy calculations on each retry. Thanks!

@lsytj0413
Copy link
Contributor Author

I think this really depends on the action timeout (single action execution limit). If you set the action timeout to a large value, and your action is "stuck" for minutes or hours you cannot fail fast even with just 2 retries.

Base on the exmaple, we want 3 MaxAttempts and Interval 5s when InsufficientResource,30 MaxAttempts and Interval 1s when FlowLimit,we can only set the ActionTimeout to 30s=max(15s, 30s),but if the only error is InsufficientResource we can stuck only 15s(less than 30s).

@lsytj0413
Copy link
Contributor Author

@tsurdilo What's the next step should i do for this?is an PR ok?

@github-actions
Copy link

github-actions bot commented Nov 8, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@lsytj0413
Copy link
Contributor Author

/remove-stale

@lsytj0413
Copy link
Contributor Author

/remove stale

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@cdavernas
Copy link
Member

cdavernas commented May 17, 2024

Closed as resolved by 1.0.0-alpha1, and therefore as part of #843

@github-project-automation github-project-automation bot moved this from In Progress to Done in Progress Tracker May 17, 2024
This was referenced May 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
change: feature New feature or request. Impacts in a minor version change
Projects
Status: Done
4 participants