-
Notifications
You must be signed in to change notification settings - Fork 39
update error code if ES dial connection error #797
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
// If no ES error type is found, it implies an error on the TCP connection level. | ||
// In this case, we want to return an unavailable code so we have to option of | ||
// handling this as a user-defined error later on. | ||
return ctx, status.Errorf(codes.Unavailable, "error checking privileges for API Key %q: %v", id, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we can reach this return if the error is not of type ElasticsearchError
or if the elasticsearchErr.Status
is not unauthorized or forbidden. Are we supposed to return codes.Unavailable
for both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding here, is that an ES error cannot be returned if ES is unavailable, since no ES instance is ever "hit" that will return an ES-type-error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if codes.Unavailable
is the "correct" type to return, but it's the only one that makes sense to me, and is the closest representation of ES being down or unavailable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding here, is that an ES error cannot be returned if ES is unavailable, since no ES instance is ever "hit" that will return an ES-type-error.
Makes sense. My question was around the if statement I mentioned in the other comment. When ES returns a 400 or 429, we will end up returning codes.Unavailable
because it is not an expected status not because it returned no es error type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah, I was thinking about this also before putting up the PR.
My thinking here was that:
- We can either make the error check exhaustive with a
switch
instead and explicitly handle each error type. - But since this happen on the auth layer, one cannot really get a
429: Too Many Requests
by definition, maybe you can get a400: Invalid Argument
, but in general my thinking was if a user cannot even auth, then they cant send any requests to ES, since they will all be rejected.
This is definitely a nuance that we need team consensus on. I might be completely missing something. Thanks @isaacaflores2 , I forgot to outline this in the PR desc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see now. If we want to change the default response status code then I think we also need to update this comment (or just remove it)
// If no ES error type is found, it implies an error on the TCP connection level.
// In this case, we want to return an unavailable code so we have to option of
// handling this as a user-defined error later on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is the correct way to handle it, Two things
- If the error is of types.ElasticsearchError, then we need to convert that to gRPC specific error and then return that. This could happen when deployment is not found (code.Internal might be better here)
- For all other unknown types, It would be better to return it as Internal Error or Unavailable as we cant reach the ES endpoint for some reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vigneshshanmugam updates in a4954a4, I have also updated the PR Desc with the step for manual testing and results.
Let me know if you want to handle any specific ES errors to make the switch
more exhaustive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get a unit test for the new branch that you're adding?
return ctx, status.Errorf(codes.Internal, "error checking privileges for API Key %q: %v", id, err) | ||
} | ||
default: | ||
// If no ES error type is found, it implies an error on the TCP connection level. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not exactly accurate. Reading the implementation of func (r HasPrivileges) Do(providedCtx context.Context) (*Response, error) {
, TCP is only a subset of the errors falling into this branch. I believe it is an ElasticsearchError when http response can be parsed as an Elasticsearch error.
For example, a broken http proxy can return a body that cannot be parsed by hasPrivileges. It is not necessarily a TCP error.
} | ||
default: | ||
// If no ES error type is found, it implies an error on the TCP connection level. | ||
return ctx, errorWithDetails(codes.Unavailable, fmt.Sprintf("retryable server error %q: %v", id, err), nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: why are we using errorWithDetails here instead of status.Errorf?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comes from a recommendation from @vigneshshanmugam in https://github.com/elastic/hosted-otel-collector/pull/1529#discussion_r2412175374 for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if I understand. Then why are we not using errorWithDetails on line 228 and 290? And what's the point of adding the domain with nil metadata as error info to the error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are talking about serverless ES that is unavailable, we might get 502 from proxy instead. Have you considered handling 502 and surfacing that as a retryable error to otel client?
retryable server error
.elasticsearchErr
is not ok, it implies the ES instance could not be reached, and therefore does not return a specific ES error, but rather acode.Internal
error. The error happens on the TCP connection level.Manual Test