Skip to content

Add middleware to prioritize download traffic #2479

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
May 5, 2020

Conversation

jtgeibel
Copy link
Member

@jtgeibel jtgeibel commented Apr 30, 2020

In recent months we've had several incidents where bot traffic has sent hundreds of expensive requests per minute, starving database resources and resulting in timeouts on download requests. While cargo will retry download requests, builds are sometimes still affected. For instance, here is an example from our own CI: https://github.com/rust-lang/crates.io/runs/631489355.

This new middleware layer will reject some requests as the database pool reaches capacity. At 20% load, the in_flight_requests count is added to the log output. At 70% load, all safe requests (GET, HEAD, OPTIONS, TRACE) are rejected immediately. This will reject many legitimate frontend requests as well, but should catch all bot traffic (which is unlikely to send PUT, POST, or DELETE requests). This filter also helps avoid rejecting frontend requests that update the database where we don’t always provide good feedback for errors in the UI.

Finally, at 80% load all non-download traffic is rejected. In other words, at least 20% of database connections are reserved for handling download traffic. By choosing to drop other requests, there should be sufficient database connections available to avoid queuing and timeouts on download requests.

There is some overlap with the LogConnectionPoolStatus middleware. These may eventually be consolidated to avoid some duplicate work and to make smarter decisions regarding the instantaneous spare capacity of individual pools (primary vs read-only replica). The current heuristics are very simple, but I believe they are sufficient to meet our current needs with a large margin for growth in traffic.

The existing middleware does give us some insight into our current in_flight_request counts on production. Looking through the logs, it is rare to have more than a few requests running at the same time. This is because download requests are completed very quickly and other API traffic accounts for only about 10 requests per second.

This middleware layer is added as the last layer in the stack. Requests that are served (such as static ember HTML) or blocked by earlier layers will not be processed by this middleware because they do not use a database connection and should not block server threads for long.

r? @pietroalbini

@bors
Copy link
Contributor

bors commented May 3, 2020

☔ The latest upstream changes (presumably #2483) made this pull request unmergeable. Please resolve the merge conflicts.

jtgeibel added 3 commits May 3, 2020 11:18
This new middleware layer will reject some requests as the database pool
reaches capacity. At 20% load, the `in_flight_requests` count is added
to the log output. At 70% load, all safe requests (`GET`, `HEAD`,
`OPTIONS`, `TRACE`) are rejected immediately. This will reject many
legitimate frontend requests as well, but should catch all bot traffic
(which is unlikely to send `PUT`, `POST`, or `DELETE` requests). This
filter also helps avoid rejecting frontend requests that update the
database where we don’t always provide good feedback for errors in the
UI.

Finally, at 80% load all non-download traffic is rejected. In other words, at
least 20% of database connections are reserved for handling download
traffic. By choosing to drop other requests, there should be sufficient
database connections available to avoid queuing and timeouts on
download requests.

There is some overlap with the LogConnectionPoolStatus middleware.
These may eventually be consolidated to avoid some duplicate work and to
make smarter decisions regarding the instantaneous spare capacity of
individual pools (primary vs read-only replica).

The existing middleware does give us some insight into our current
in_flight_request counts on production. Looking through the logs, it is
rare to have more than a few requests running at the same time. This is
because download requests are completed very quickly and other API
traffic accounts for about 10 requests per second.
To serve uploaded crates during local development, the middleware must
serve the request before the ember rewrite occurs.  Additionally, these
requests should not affect the tally in the `BalanceCapacity`
middleware.
@jtgeibel jtgeibel force-pushed the balance-db-pool-usage branch from 34caeb9 to 944a98d Compare May 3, 2020 15:20
Copy link
Member

@pietroalbini pietroalbini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mostly looks good!

The numbers seems fine, but I'd prefer to have them as environment variables with default values: if a need to tweak them during an outage arises I don't want to have to look into the source to find where they're defined and do a full deploy.


fn over_capcity_response() -> AfterResult {
// TODO: Generate an alert so we can investigate
let body = "Service temporarily unavailable";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll want an explicit dropped_due_to_low_capacity=true or similar item in the log, to know at a glance why that request returned a 503.

@jtgeibel
Copy link
Member Author

jtgeibel commented May 5, 2020

Thanks for the feedback! I've pushed a new commit to fix the typo, add logging to requests rejected due to capacity, and allow environment variables to override the hard-coded defaults.

@pietroalbini
Copy link
Member

This looks good!

@bors r+

@bors
Copy link
Contributor

bors commented May 5, 2020

📌 Commit 3a8bb56 has been approved by pietroalbini

@bors
Copy link
Contributor

bors commented May 5, 2020

⌛ Testing commit 3a8bb56 with merge 88b53bd...

@bors
Copy link
Contributor

bors commented May 5, 2020

☀️ Test successful - checks-travis
Approved by: pietroalbini
Pushing 88b53bd to master...

@bors bors merged commit 88b53bd into rust-lang:master May 5, 2020
@jtgeibel jtgeibel deleted the balance-db-pool-usage branch May 11, 2020 23:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants