Skip to content

Include database status on the /health endpoint or a new endpoint like /db-health #6588

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
sunshineo opened this issue Apr 9, 2020 · 13 comments
Labels
type:feature New feature or improvement of existing feature

Comments

@sunshineo
Copy link
Contributor

Is your feature request related to a problem? Please describe.
Deployed Parse server using k8 and used /parse/health for the readinessProbe and livenessProbe probe. But when deploy new code and broke the database connection, /parse/health still returns 200 OK. So old container was killed and new broken one brought online

Describe the solution you'd like
Only return 200 ok after db is connected on /parse/health or /parse/db-health

Describe alternatives you've considered
Tried to create a cloud function that checks db connection but that requires POST and have application-id in the header, which k8 cannot do

Additional context
Related to #4575

@davimacedo davimacedo added type:feature New feature or improvement of existing feature good first issue labels Apr 14, 2020
@davimacedo
Copy link
Member

It sounds reasonable for me. Are willed to tackle this one?

@mtrezza
Copy link
Member

mtrezza commented Apr 14, 2020

I think it's a good PR but this is definitely a breaking change and should be denoted clearly in the release notes.

Currently, calling /parse/health is an endpoint that could be in use to tell whether parse server is reachable via a given domain, for connection string failover. So the information derived from the endpoint response is slightly different when it also depends on the DB connection string health.

@sunshineo
Copy link
Contributor Author

New info I learned: for K8, there is a liveness probe and a readiness probe. The liveness probe should be current /parse/health . It only means if the docker is responsible or not. And if not, K8 will restart the docker. But a readiness probe should include if there is a proper database connection because if that is true, K8 will send traffic to it.

So we should have a separated url like /parse/db-health

@mtrezza
Copy link
Member

mtrezza commented Apr 15, 2020

Good point. Is there a scenario in which the DB is not ready immediately and will become ready after some time? Does parse server support a DB connection that is not ready on start-up?

What do you think about adding a general /parse/ready endpoint? I am thinking about a cache system or other modules that may need to become ready for parse server to function. So all these modules could be summarized in a single ready endpoint, and it's easily extendable.

@sunshineo
Copy link
Contributor Author

Thank you @mtrezza . I'll try to create a ready endpoint starting with just db and later we can add more. I plan to just do a query on the _User collection with count=0. I think it will work but let me know if there is a better way

@mtrezza
Copy link
Member

mtrezza commented Apr 15, 2020

It depends on how you define “ready” for the DB, which has itself a status endpoint.

For example:

The driver has a isConnected method to determine whether a connection between driver and DB is established.

https://mongodb.github.io/node-mongodb-native/3.5/api/MongoClient.html#isConnected

The server status, which may be a good middle ground for your ready endpoint.

https://mongodb.github.io/node-mongodb-native/3.5/api/Admin.html#serverStatus

The replica set status, where you determine the individual node status. Just because you can execute a query, doesn’t mean the replica set is fully up and running and ready for what you consider your modus operandi. Maybe a separate PR in the future, if someone wants to get into these details.

https://mongodb.github.io/node-mongodb-native/3.5/api/Admin.html#replSetGetStatus

Sent with GitHawk

@davimacedo
Copy link
Member

Thank you @mtrezza . I'll try to create a ready endpoint starting with just db and later we can add more. I plan to just do a query on the _User collection with count=0. I think it will work but let me know if there is a better way

I'd do a query in the schema table.

@mtrezza
Copy link
Member

mtrezza commented Apr 16, 2020

I did some research and came across two concepts for these endpoints. Maybe it can give you some inspiration.

  1. Accumulative endpoints

/health/live - The application is up and running.
/health/ready - The application is ready to serve requests.
/health - Accumulating all health check procedures in the application.

https://quarkus.io/guides/microprofile-health

  1. Separate endpoints

GET /-/health
Only checks whether the application server is running It does not verify the database or other services are running. A successful response will return a 200 status code

GET /-/readiness
Sometimes, applications are temporarily unable to serve traffic. For example, an application might need to load large data or configuration files during startup, or depend on external services after startup. In such cases, you don’t want to kill the application, but you don’t want to send it requests either.

GET /-/liveness
Many applications running for long periods of time eventually transition to broken states, and cannot recover except by being restarted.

GET /-/metrics
Node process metrics
Infrastructure to collect internal metrics (like southbound APIs)
API to retrieve the collected metrics and to reset them.

https://www.npmjs.com/package/@ozawa/express-health-check-middleware

@davimacedo
Copy link
Member

I believe option 1 is good enough for now but option 2 would be really great. I have no idea how we could ensure all those states in option 2 though.

@mtrezza
Copy link
Member

mtrezza commented Apr 26, 2020

I think we should decide for a long-term option now, because if we want to change this in a few months, we may have a breaking change and things get more complex.

  • Option 1 takes /health as summary endpoint (everything is running and ready).
  • Option 2 takes /health as a single-aspect endpoint (parse server is running, but maybe not ready).

What is more common in 3rd party systems to ensure compatibility and practicability for this endpoint?

@mman
Copy link
Contributor

mman commented Apr 28, 2020

Looking around my k8s cluster, this is what prometheus uses:

    Liveness:     http-get http://:web/-/healthy delay=0s timeout=3s period=5s #success=1 #failure=6
    Readiness:    http-get http://:web/-/ready delay=0s timeout=3s period=5s #success=1 #failure=120

and this is what ambassador uses:

    Liveness:   http-get http://:8877/ambassador/v0/check_alive delay=30s timeout=1s period=3s #success=1 #failure=3
    Readiness:  http-get http://:8877/ambassador/v0/check_ready delay=30s timeout=1s period=3s #success=1 #failure=3

So IMHO focusing on ready, vs. live, where live probe avoids accidental restarts and ready probe serves as a signal to start routing traffic sounds better then to focus on overall health vs db health.

This is a great resource where I think we should aim to go: https://cloud.google.com/blog/products/gcp/kubernetes-best-practices-setting-up-health-checks-with-readiness-and-liveness-probes

@mtrezza
Copy link
Member

mtrezza commented Apr 28, 2020

@mman Great research, makes absolutely sense. If I understand it correctly, it’s option 2, right?

Sent with GitHawk

@mman
Copy link
Contributor

mman commented Apr 28, 2020

Yes, option 2, but also potentially changing names of the endpoints to better indicate their purpose. /healthy + /ready as prometheus uses sounds logical to me, perhaps keeping the legacy /health around pointing towards /healthy as well for compatibility reasons.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:feature New feature or improvement of existing feature
Projects
None yet
Development

No branches or pull requests

4 participants