[nexus] Refuse to boot until Nexus finds a matching DB schema #3531

smklein · 2023-07-08T00:01:31Z

Nexus now reads the db_metadata table, ensuring that the schema "in-the-DB" matches the schema "in Nexus". At the moment, we're being picky, and we require an exact match.
Adds tests validating that we refuse to boot if the schema does not match
Add tests validating that we eventually boot Nexus if the schema is corrected

nexus/db-queries/src/db/datastore/db_metadata.rs

nexus/db-model/src/schema.rs

nexus/src/app/mod.rs

davepacheco

I think this looks okay as-is and urgency probably demands we land it soon. I had one substantive comment below about availability and a few typo-level nits.

I also wonder if it would make more sense to put this into DataStore initialization and have the DataStore abstract this -- i.e., it would check this on startup and refuse to give out connections until/unless the schema version matched.

nexus/db-model/src/db_metadata.rs

nexus/src/app/db_metadata.rs

davepacheco · 2023-07-11T16:55:09Z

nexus/src/app/mod.rs

+                    warn!(log, "Cannot read database schema version: {e}");
+                }
+            }
+            tokio::time::sleep(std::time::Duration::from_secs(1)).await;


I'd suggest sleeping longer. Ideally I think we'd use backoff if we're failing to read the schema at all, and a fixed sleep (like 60 seconds?) if we successfully read the wrong schema.

(The thing I'm most worried about here is if we're recovering from a database outage and hitting the database too hard. Secondarily it seems unnecessary to poll every second for something that's unlikely to change that soon.)

WDYT about using backoff? I have a test for this behavior, so I kinda want to avoid using a 60-second timeout, unless you think I should plumb this through as a configuration parameter?

Yeah let's just do backoff for now.

The thing that makes me a little uneasy about it is that our max backoff used to be an hour, which I still think is more appropriate for overload conditions than the current max of about 3 minutes. (I understood that we changed it to 3 minutes basically because #3082 was too painful otherwise.) At some point we're going to need to bump that up a lot more again. And an hour is a long time to wait in this particular case. (This condition also doesn't really have the backoff property that the client's own behavior might be affecting its ability to succeed, so backoff is a little weird.)

But in practice I think this will be fine for the foreseeable future so let's go with it.

smklein · 2023-07-12T01:08:51Z

I also wonder if it would make more sense to put this into DataStore initialization and have the DataStore abstract this -- i.e., it would check this on startup and refuse to give out connections until/unless the schema version matched.

Sure, I went ahead and made this change. The check is now part of DataStore initialization.

davepacheco · 2023-07-12T18:34:00Z

I also wonder if it would make more sense to put this into DataStore initialization and have the DataStore abstract this -- i.e., it would check this on startup and refuse to give out connections until/unless the schema version matched.

Sure, I went ahead and made this change. The check is now part of DataStore initialization.

I meant something a little different but I wouldn't do it right now anyway.

(What I meant was more like:

There's an enum SchemaVersion { Acceptable, Bad, Unknown }
You can still create the datastore synchronously
DataStore has a schema_version_acceptable: SchemaVersion that starts out Unknown
In the background, on startup, DataStore does this check. On completion, it sets the flag to Acceptable
if you try to check out a connection (with DataStore::pool*), if schema_version_acceptable != Acceptable, you immediately get a 500 (if Bad) or 503 (if Unknown) error

The reason is just to make an "on startup" less of a special case. If we decide later to poll on the version to detect if it's changed (e.g., in a background task), we could just set schema_version_acceptable to Bad and stuff should generally do the right thing.

I'm not sure this is better for us right now. It'd certainly create a bunch more noise.)

[nexus] Refuse to boot until we're happy with our schema version

f97961c

smklein commented Jul 8, 2023

View reviewed changes

nexus/db-queries/src/db/datastore/db_metadata.rs Show resolved Hide resolved

smklein commented Jul 8, 2023

View reviewed changes

nexus/db-model/src/schema.rs Show resolved Hide resolved

smklein commented Jul 8, 2023

View reviewed changes

nexus/src/app/mod.rs Outdated Show resolved Hide resolved

smklein added 2 commits July 10, 2023 08:32

Merge branch 'main' into schema

713a762

tests!

f4e3722

smklein requested a review from davepacheco July 10, 2023 16:49

smklein marked this pull request as ready for review July 10, 2023 17:01

davepacheco approved these changes Jul 11, 2023

View reviewed changes

smklein added 3 commits July 11, 2023 12:40

Merge branch 'main' into schema

55f0882

review feedback

a05bfa6

Update timeouts to deal with retry policy

81c43e0

Patch populator test to work with new DataStore construction

869bcc2

davepacheco approved these changes Jul 12, 2023

View reviewed changes

Merge branch 'main' into schema

f1baa52

smklein merged commit 76d51d8 into main Jul 13, 2023

smklein deleted the schema branch July 13, 2023 18:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[nexus] Refuse to boot until Nexus finds a matching DB schema #3531

[nexus] Refuse to boot until Nexus finds a matching DB schema #3531

Uh oh!

smklein commented Jul 8, 2023 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

davepacheco left a comment

Uh oh!

Uh oh!

Uh oh!

davepacheco Jul 11, 2023

Uh oh!

smklein Jul 12, 2023

Uh oh!

davepacheco Jul 12, 2023

Uh oh!

smklein commented Jul 12, 2023

Uh oh!

davepacheco commented Jul 12, 2023

Uh oh!

Uh oh!

[nexus] Refuse to boot until Nexus finds a matching DB schema #3531

[nexus] Refuse to boot until Nexus finds a matching DB schema #3531

Uh oh!

Conversation

smklein commented Jul 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

davepacheco left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

davepacheco Jul 11, 2023

Choose a reason for hiding this comment

Uh oh!

smklein Jul 12, 2023

Choose a reason for hiding this comment

Uh oh!

davepacheco Jul 12, 2023

Choose a reason for hiding this comment

Uh oh!

smklein commented Jul 12, 2023

Uh oh!

davepacheco commented Jul 12, 2023

Uh oh!

Uh oh!

smklein commented Jul 8, 2023 •

edited

Loading