Skip to content

Conversation

smklein
Copy link
Collaborator

@smklein smklein commented Jul 8, 2023

  • Nexus now reads the db_metadata table, ensuring that the schema "in-the-DB" matches the schema "in Nexus". At the moment, we're being picky, and we require an exact match.
  • Adds tests validating that we refuse to boot if the schema does not match
  • Add tests validating that we eventually boot Nexus if the schema is corrected

@smklein smklein requested a review from davepacheco July 10, 2023 16:49
@smklein smklein marked this pull request as ready for review July 10, 2023 17:01
Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks okay as-is and urgency probably demands we land it soon. I had one substantive comment below about availability and a few typo-level nits.

I also wonder if it would make more sense to put this into DataStore initialization and have the DataStore abstract this -- i.e., it would check this on startup and refuse to give out connections until/unless the schema version matched.

warn!(log, "Cannot read database schema version: {e}");
}
}
tokio::time::sleep(std::time::Duration::from_secs(1)).await;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest sleeping longer. Ideally I think we'd use backoff if we're failing to read the schema at all, and a fixed sleep (like 60 seconds?) if we successfully read the wrong schema.

(The thing I'm most worried about here is if we're recovering from a database outage and hitting the database too hard. Secondarily it seems unnecessary to poll every second for something that's unlikely to change that soon.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT about using backoff? I have a test for this behavior, so I kinda want to avoid using a 60-second timeout, unless you think I should plumb this through as a configuration parameter?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah let's just do backoff for now.

The thing that makes me a little uneasy about it is that our max backoff used to be an hour, which I still think is more appropriate for overload conditions than the current max of about 3 minutes. (I understood that we changed it to 3 minutes basically because #3082 was too painful otherwise.) At some point we're going to need to bump that up a lot more again. And an hour is a long time to wait in this particular case. (This condition also doesn't really have the backoff property that the client's own behavior might be affecting its ability to succeed, so backoff is a little weird.)

But in practice I think this will be fine for the foreseeable future so let's go with it.

@smklein
Copy link
Collaborator Author

smklein commented Jul 12, 2023

I also wonder if it would make more sense to put this into DataStore initialization and have the DataStore abstract this -- i.e., it would check this on startup and refuse to give out connections until/unless the schema version matched.

Sure, I went ahead and made this change. The check is now part of DataStore initialization.

@davepacheco
Copy link
Collaborator

I also wonder if it would make more sense to put this into DataStore initialization and have the DataStore abstract this -- i.e., it would check this on startup and refuse to give out connections until/unless the schema version matched.

Sure, I went ahead and made this change. The check is now part of DataStore initialization.

I meant something a little different but I wouldn't do it right now anyway.

(What I meant was more like:

  • There's an enum SchemaVersion { Acceptable, Bad, Unknown }
  • You can still create the datastore synchronously
  • DataStore has a schema_version_acceptable: SchemaVersion that starts out Unknown
  • In the background, on startup, DataStore does this check. On completion, it sets the flag to Acceptable
  • if you try to check out a connection (with DataStore::pool*), if schema_version_acceptable != Acceptable, you immediately get a 500 (if Bad) or 503 (if Unknown) error

The reason is just to make an "on startup" less of a special case. If we decide later to poll on the version to detect if it's changed (e.g., in a background task), we could just set schema_version_acceptable to Bad and stuff should generally do the right thing.

I'm not sure this is better for us right now. It'd certainly create a bunch more noise.)

@smklein smklein merged commit 76d51d8 into main Jul 13, 2023
@smklein smklein deleted the schema branch July 13, 2023 18:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants