Skip to content

Conversation

rubvs
Copy link

@rubvs rubvs commented Oct 5, 2025

  • If ES is down, we want to return an error unavailable code, along with a user facing error: retryable server error.
  • If elasticsearchErr is not ok, it implies the ES instance could not be reached, and therefore does not return a specific ES error, but rather a code.Internal error. The error happens on the TCP connection level.

Manual Test

# Console 1: Spin up a cluster
> tilt up

# Console 2: Make ES unavailable
> kubectl scale statefulset elasticsearch-es-default --replicas=0

# Console 3: Ensure a retryable error is returned to the user when trying to auth to an unavailable ES.
> TEST_COUNT=10 make run-otelbench mode=local

2025-10-05T15:56:21.572Z  error  internal/queue_sender.go:51  Exporting failed. Dropping data.
{
  "resource": {
    "service.instance.id": "645a4319-43c4-4f8b-b97b-27f5adf6dc0c",
    "service.name": "/ko-app/loadgen",
    "service.version": "0.0.1"
  },
  "otelcol.component.id": "otlp",
  "otelcol.component.kind": "exporter",
  "otelcol.signal": "logs",
  "error": "interrupted due to shutdown: rpc error: code = Unavailable desc = rpc error: code = Unavailable desc = retryable server error \"o1voipUB6v8S06vJAX5J\": an error happened during the HasPrivileges query execution: dial tcp 10.96.217.236:9200: connect: connection refused",
  "dropped_items": 325
}

@rubvs rubvs requested review from a team as code owners October 5, 2025 15:32
@rubvs rubvs requested a review from vigneshshanmugam October 5, 2025 15:45
// If no ES error type is found, it implies an error on the TCP connection level.
// In this case, we want to return an unavailable code so we have to option of
// handling this as a user-defined error later on.
return ctx, status.Errorf(codes.Unavailable, "error checking privileges for API Key %q: %v", id, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we can reach this return if the error is not of type ElasticsearchError or if the elasticsearchErr.Status is not unauthorized or forbidden. Are we supposed to return codes.Unavailable for both?

Copy link
Author

@rubvs rubvs Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding here, is that an ES error cannot be returned if ES is unavailable, since no ES instance is ever "hit" that will return an ES-type-error.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if codes.Unavailable is the "correct" type to return, but it's the only one that makes sense to me, and is the closest representation of ES being down or unavailable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding here, is that an ES error cannot be returned if ES is unavailable, since no ES instance is ever "hit" that will return an ES-type-error.

Makes sense. My question was around the if statement I mentioned in the other comment. When ES returns a 400 or 429, we will end up returning codes.Unavailable because it is not an expected status not because it returned no es error type

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, I was thinking about this also before putting up the PR.

My thinking here was that:

  • We can either make the error check exhaustive with a switch instead and explicitly handle each error type.
  • But since this happen on the auth layer, one cannot really get a 429: Too Many Requests by definition, maybe you can get a 400: Invalid Argument, but in general my thinking was if a user cannot even auth, then they cant send any requests to ES, since they will all be rejected.

This is definitely a nuance that we need team consensus on. I might be completely missing something. Thanks @isaacaflores2 , I forgot to outline this in the PR desc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see now. If we want to change the default response status code then I think we also need to update this comment (or just remove it)

// If no ES error type is found, it implies an error on the TCP connection level.
// In this case, we want to return an unavailable code so we have to option of
// handling this as a user-defined error later on.

Copy link
Member

@vigneshshanmugam vigneshshanmugam Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the correct way to handle it, Two things

  1. If the error is of types.ElasticsearchError, then we need to convert that to gRPC specific error and then return that. This could happen when deployment is not found (code.Internal might be better here)
  2. For all other unknown types, It would be better to return it as Internal Error or Unavailable as we cant reach the ES endpoint for some reason.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vigneshshanmugam updates in a4954a4, I have also updated the PR Desc with the step for manual testing and results.

Let me know if you want to handle any specific ES errors to make the switch more exhaustive.

Copy link
Member

@carsonip carsonip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get a unit test for the new branch that you're adding?

return ctx, status.Errorf(codes.Internal, "error checking privileges for API Key %q: %v", id, err)
}
default:
// If no ES error type is found, it implies an error on the TCP connection level.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly accurate. Reading the implementation of func (r HasPrivileges) Do(providedCtx context.Context) (*Response, error) {, TCP is only a subset of the errors falling into this branch. I believe it is an ElasticsearchError when http response can be parsed as an Elasticsearch error.

For example, a broken http proxy can return a body that cannot be parsed by hasPrivileges. It is not necessarily a TCP error.

}
default:
// If no ES error type is found, it implies an error on the TCP connection level.
return ctx, errorWithDetails(codes.Unavailable, fmt.Sprintf("retryable server error %q: %v", id, err), nil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why are we using errorWithDetails here instead of status.Errorf?

Copy link
Author

@rubvs rubvs Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I understand. Then why are we not using errorWithDetails on line 228 and 290? And what's the point of adding the domain with nil metadata as error info to the error?

Copy link
Member

@carsonip carsonip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are talking about serverless ES that is unavailable, we might get 502 from proxy instead. Have you considered handling 502 and surfacing that as a retryable error to otel client?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants