-
Notifications
You must be signed in to change notification settings - Fork 656
Add middleware to prioritize download traffic #2479
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
168e47e
to
35734fe
Compare
☔ The latest upstream changes (presumably #2483) made this pull request unmergeable. Please resolve the merge conflicts. |
This new middleware layer will reject some requests as the database pool reaches capacity. At 20% load, the `in_flight_requests` count is added to the log output. At 70% load, all safe requests (`GET`, `HEAD`, `OPTIONS`, `TRACE`) are rejected immediately. This will reject many legitimate frontend requests as well, but should catch all bot traffic (which is unlikely to send `PUT`, `POST`, or `DELETE` requests). This filter also helps avoid rejecting frontend requests that update the database where we don’t always provide good feedback for errors in the UI. Finally, at 80% load all non-download traffic is rejected. In other words, at least 20% of database connections are reserved for handling download traffic. By choosing to drop other requests, there should be sufficient database connections available to avoid queuing and timeouts on download requests. There is some overlap with the LogConnectionPoolStatus middleware. These may eventually be consolidated to avoid some duplicate work and to make smarter decisions regarding the instantaneous spare capacity of individual pools (primary vs read-only replica). The existing middleware does give us some insight into our current in_flight_request counts on production. Looking through the logs, it is rare to have more than a few requests running at the same time. This is because download requests are completed very quickly and other API traffic accounts for about 10 requests per second.
To serve uploaded crates during local development, the middleware must serve the request before the ember rewrite occurs. Additionally, these requests should not affect the tally in the `BalanceCapacity` middleware.
34caeb9
to
944a98d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mostly looks good!
The numbers seems fine, but I'd prefer to have them as environment variables with default values: if a need to tweak them during an outage arises I don't want to have to look into the source to find where they're defined and do a full deploy.
|
||
fn over_capcity_response() -> AfterResult { | ||
// TODO: Generate an alert so we can investigate | ||
let body = "Service temporarily unavailable"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'll want an explicit dropped_due_to_low_capacity=true
or similar item in the log, to know at a glance why that request returned a 503.
... and fix a typo
Thanks for the feedback! I've pushed a new commit to fix the typo, add logging to requests rejected due to capacity, and allow environment variables to override the hard-coded defaults. |
This looks good! @bors r+ |
📌 Commit 3a8bb56 has been approved by |
☀️ Test successful - checks-travis |
In recent months we've had several incidents where bot traffic has sent hundreds of expensive requests per minute, starving database resources and resulting in timeouts on download requests. While cargo will retry download requests, builds are sometimes still affected. For instance, here is an example from our own CI: https://github.com/rust-lang/crates.io/runs/631489355.
This new middleware layer will reject some requests as the database pool reaches capacity. At 20% load, the
in_flight_requests
count is added to the log output. At 70% load, all safe requests (GET
,HEAD
,OPTIONS
,TRACE
) are rejected immediately. This will reject many legitimate frontend requests as well, but should catch all bot traffic (which is unlikely to sendPUT
,POST
, orDELETE
requests). This filter also helps avoid rejecting frontend requests that update the database where we don’t always provide good feedback for errors in the UI.Finally, at 80% load all non-download traffic is rejected. In other words, at least 20% of database connections are reserved for handling download traffic. By choosing to drop other requests, there should be sufficient database connections available to avoid queuing and timeouts on download requests.
There is some overlap with the
LogConnectionPoolStatus
middleware. These may eventually be consolidated to avoid some duplicate work and to make smarter decisions regarding the instantaneous spare capacity of individual pools (primary vs read-only replica). The current heuristics are very simple, but I believe they are sufficient to meet our current needs with a large margin for growth in traffic.The existing middleware does give us some insight into our current in_flight_request counts on production. Looking through the logs, it is rare to have more than a few requests running at the same time. This is because download requests are completed very quickly and other API traffic accounts for only about 10 requests per second.
This middleware layer is added as the last layer in the stack. Requests that are served (such as static
ember
HTML) or blocked by earlier layers will not be processed by this middleware because they do not use a database connection and should not block server threads for long.r? @pietroalbini