update error code if ES dial connection error #797

rubvs · 2025-10-05T15:32:39Z

If ES is down, we want to return an error unavailable code, along with a user facing error: retryable server error.
If elasticsearchErr is not ok, it implies the ES instance could not be reached, and therefore does not return a specific ES error, but rather a code.Internal error. The error happens on the TCP connection level.

Manual Test

# Console 1: Spin up a cluster
> tilt up

# Console 2: Make ES unavailable
> kubectl scale statefulset elasticsearch-es-default --replicas=0

# Console 3: Ensure a retryable error is returned to the user when trying to auth to an unavailable ES.
> TEST_COUNT=10 make run-otelbench mode=local

2025-10-05T15:56:21.572Z  error  internal/queue_sender.go:51  Exporting failed. Dropping data.
{
  "resource": {
    "service.instance.id": "645a4319-43c4-4f8b-b97b-27f5adf6dc0c",
    "service.name": "/ko-app/loadgen",
    "service.version": "0.0.1"
  },
  "otelcol.component.id": "otlp",
  "otelcol.component.kind": "exporter",
  "otelcol.signal": "logs",
  "error": "interrupted due to shutdown: rpc error: code = Unavailable desc = rpc error: code = Unavailable desc = retryable server error \"o1voipUB6v8S06vJAX5J\": an error happened during the HasPrivileges query execution: dial tcp 10.96.217.236:9200: connect: connection refused",
  "dropped_items": 325
}

extension/apikeyauthextension/authenticator_test.go

isaacaflores2 · 2025-10-07T16:33:31Z

extension/apikeyauthextension/authenticator.go

+		// If no ES error type is found, it implies an error on the TCP connection level.
+		// In this case, we want to return an unavailable code so we have to option of
+		// handling this as a user-defined error later on.
+		return ctx, status.Errorf(codes.Unavailable, "error checking privileges for API Key %q: %v", id, err)


It looks like we can reach this return if the error is not of type ElasticsearchError or if the elasticsearchErr.Status is not unauthorized or forbidden. Are we supposed to return codes.Unavailable for both?

My understanding here, is that an ES error cannot be returned if ES is unavailable, since no ES instance is ever "hit" that will return an ES-type-error.

I'm not sure if codes.Unavailable is the "correct" type to return, but it's the only one that makes sense to me, and is the closest representation of ES being down or unavailable.

My understanding here, is that an ES error cannot be returned if ES is unavailable, since no ES instance is ever "hit" that will return an ES-type-error.

Makes sense. My question was around the if statement I mentioned in the other comment. When ES returns a 400 or 429, we will end up returning codes.Unavailable because it is not an expected status not because it returned no es error type

Oh yeah, I was thinking about this also before putting up the PR.

My thinking here was that:

We can either make the error check exhaustive with a switch instead and explicitly handle each error type.

But since this happen on the auth layer, one cannot really get a 429: Too Many Requests by definition, maybe you can get a 400: Invalid Argument, but in general my thinking was if a user cannot even auth, then they cant send any requests to ES, since they will all be rejected.

This is definitely a nuance that we need team consensus on. I might be completely missing something. Thanks @isaacaflores2 , I forgot to outline this in the PR desc.

Ah I see now. If we want to change the default response status code then I think we also need to update this comment (or just remove it)

// If no ES error type is found, it implies an error on the TCP connection level. // In this case, we want to return an unavailable code so we have to option of // handling this as a user-defined error later on.

I don't think this is the correct way to handle it, Two things

If the error is of types.ElasticsearchError, then we need to convert that to gRPC specific error and then return that. This could happen when deployment is not found (code.Internal might be better here)

For all other unknown types, It would be better to return it as Internal Error or Unavailable as we cant reach the ES endpoint for some reason.

@vigneshshanmugam updates in a4954a4, I have also updated the PR Desc with the step for manual testing and results.

Let me know if you want to handle any specific ES errors to make the switch more exhaustive.

carsonip

Can we get a unit test for the new branch that you're adding?

carsonip · 2025-10-10T13:37:20Z

extension/apikeyauthextension/authenticator.go

+				return ctx, status.Errorf(codes.Internal, "error checking privileges for API Key %q: %v", id, err)
 			}
+		default:
+			// If no ES error type is found, it implies an error on the TCP connection level.


Not exactly accurate. Reading the implementation of func (r HasPrivileges) Do(providedCtx context.Context) (*Response, error) {, TCP is only a subset of the errors falling into this branch. I believe it is an ElasticsearchError when http response can be parsed as an Elasticsearch error.

For example, a broken http proxy can return a body that cannot be parsed by hasPrivileges. It is not necessarily a TCP error.

carsonip · 2025-10-10T13:46:49Z

extension/apikeyauthextension/authenticator.go

 			}
+		default:
+			// If no ES error type is found, it implies an error on the TCP connection level.
+			return ctx, errorWithDetails(codes.Unavailable, fmt.Sprintf("retryable server error %q: %v", id, err), nil)


nit: why are we using errorWithDetails here instead of status.Errorf?

This comes from a recommendation from @vigneshshanmugam in https://github.com/elastic/hosted-otel-collector/pull/1529#discussion_r2412175374 for consistency.

I'm not sure if I understand. Then why are we not using errorWithDetails on line 228 and 290? And what's the point of adding the domain with nil metadata as error info to the error?

carsonip

If we are talking about serverless ES that is unavailable, we might get 502 from proxy instead. Have you considered handling 502 and surfacing that as a retryable error to otel client?

update error code is ES dial connection error

c2da00f

rubvs requested review from a team as code owners October 5, 2025 15:32

rubvs added 2 commits October 5, 2025 11:36

code cleanup

eb54af7

Merge branch 'main' into update-apikey-error-code

d09f84d

rubvs requested a review from vigneshshanmugam October 5, 2025 15:45

isaacaflores2 reviewed Oct 7, 2025

View reviewed changes

improve error handling

a4954a4

carsonip reviewed Oct 10, 2025

View reviewed changes

update error code if ES dial connection error #797

Are you sure you want to change the base?

update error code if ES dial connection error #797

Uh oh!

Conversation

rubvs commented Oct 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Manual Test

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rubvs Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vigneshshanmugam Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

carsonip left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rubvs Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

carsonip left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rubvs commented Oct 5, 2025 •

edited

Loading

rubvs Oct 7, 2025 •

edited

Loading

vigneshshanmugam Oct 7, 2025 •

edited

Loading

rubvs Oct 10, 2025 •

edited

Loading