Skip to content

Failed to download OTS in US cluster (possibly happens for prebuilds, only) #8096

@csweichel

Description

@csweichel

Bug description

Looking at the logs we're seeing a surprising amount of OTS download failures as part of workspace content initialisation: https://console.cloud.google.com/logs/query;query=%22cannot%20download%20OTS%22%0A;timeRange=PT24H;cursorTimestamp=2022-02-08T14:43:32Z?project=workspace-clusters

Each of those failures is likely to yield a failed workspace - at least if the repo was private.
Possible contributing factors:

(edited by Sven)

  • as this seems to happen only in prebuilds (only US cluster and lots of this error messages in d_b_prebuild_workspace), it could be that for some reason the time between the OTS is created and when it gets requested is longer than 30min (the lifetime of a token). Prebuild clusters are sometimes heavily packed so maybe there is just too much time in scaling up etc.
  • we attempt to download the OTS multiple times for some reason. That's most likely a bug in the initializer. Checking the server logs and/or adding metrics would help identifying this.

As part of a fix, we should introduce OTS download failure metrics and keep an eye on them.

Steps to reproduce

Check the logs

Metadata

Metadata

Assignees

No one assigned

    Labels

    meta: staleThis issue/PR is stale and will be closed soonteam: workspaceIssue belongs to the Workspace teamtype: bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions