-
Notifications
You must be signed in to change notification settings - Fork 56
[nexus] Refuse to boot until Nexus finds a matching DB schema #3531
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks okay as-is and urgency probably demands we land it soon. I had one substantive comment below about availability and a few typo-level nits.
I also wonder if it would make more sense to put this into DataStore initialization and have the DataStore abstract this -- i.e., it would check this on startup and refuse to give out connections until/unless the schema version matched.
nexus/src/app/mod.rs
Outdated
warn!(log, "Cannot read database schema version: {e}"); | ||
} | ||
} | ||
tokio::time::sleep(std::time::Duration::from_secs(1)).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest sleeping longer. Ideally I think we'd use backoff if we're failing to read the schema at all, and a fixed sleep (like 60 seconds?) if we successfully read the wrong schema.
(The thing I'm most worried about here is if we're recovering from a database outage and hitting the database too hard. Secondarily it seems unnecessary to poll every second for something that's unlikely to change that soon.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WDYT about using backoff? I have a test for this behavior, so I kinda want to avoid using a 60-second timeout, unless you think I should plumb this through as a configuration parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah let's just do backoff for now.
The thing that makes me a little uneasy about it is that our max backoff used to be an hour, which I still think is more appropriate for overload conditions than the current max of about 3 minutes. (I understood that we changed it to 3 minutes basically because #3082 was too painful otherwise.) At some point we're going to need to bump that up a lot more again. And an hour is a long time to wait in this particular case. (This condition also doesn't really have the backoff property that the client's own behavior might be affecting its ability to succeed, so backoff is a little weird.)
But in practice I think this will be fine for the foreseeable future so let's go with it.
Sure, I went ahead and made this change. The check is now part of DataStore initialization. |
I meant something a little different but I wouldn't do it right now anyway. (What I meant was more like:
The reason is just to make an "on startup" less of a special case. If we decide later to poll on the version to detect if it's changed (e.g., in a background task), we could just set I'm not sure this is better for us right now. It'd certainly create a bunch more noise.) |
db_metadata
table, ensuring that the schema "in-the-DB" matches the schema "in Nexus". At the moment, we're being picky, and we require an exact match.