Add middleware to prioritize download traffic #2479

jtgeibel · 2020-04-30T22:55:29Z

In recent months we've had several incidents where bot traffic has sent hundreds of expensive requests per minute, starving database resources and resulting in timeouts on download requests. While cargo will retry download requests, builds are sometimes still affected. For instance, here is an example from our own CI: https://github.com/rust-lang/crates.io/runs/631489355.

This new middleware layer will reject some requests as the database pool reaches capacity. At 20% load, the in_flight_requests count is added to the log output. At 70% load, all safe requests (GET, HEAD, OPTIONS, TRACE) are rejected immediately. This will reject many legitimate frontend requests as well, but should catch all bot traffic (which is unlikely to send PUT, POST, or DELETE requests). This filter also helps avoid rejecting frontend requests that update the database where we don’t always provide good feedback for errors in the UI.

Finally, at 80% load all non-download traffic is rejected. In other words, at least 20% of database connections are reserved for handling download traffic. By choosing to drop other requests, there should be sufficient database connections available to avoid queuing and timeouts on download requests.

There is some overlap with the LogConnectionPoolStatus middleware. These may eventually be consolidated to avoid some duplicate work and to make smarter decisions regarding the instantaneous spare capacity of individual pools (primary vs read-only replica). The current heuristics are very simple, but I believe they are sufficient to meet our current needs with a large margin for growth in traffic.

The existing middleware does give us some insight into our current in_flight_request counts on production. Looking through the logs, it is rare to have more than a few requests running at the same time. This is because download requests are completed very quickly and other API traffic accounts for only about 10 requests per second.

This middleware layer is added as the last layer in the stack. Requests that are served (such as static ember HTML) or blocked by earlier layers will not be processed by this middleware because they do not use a database connection and should not block server threads for long.

r? @pietroalbini

bors · 2020-05-03T07:09:45Z

☔ The latest upstream changes (presumably #2483) made this pull request unmergeable. Please resolve the merge conflicts.

This new middleware layer will reject some requests as the database pool reaches capacity. At 20% load, the `in_flight_requests` count is added to the log output. At 70% load, all safe requests (`GET`, `HEAD`, `OPTIONS`, `TRACE`) are rejected immediately. This will reject many legitimate frontend requests as well, but should catch all bot traffic (which is unlikely to send `PUT`, `POST`, or `DELETE` requests). This filter also helps avoid rejecting frontend requests that update the database where we don’t always provide good feedback for errors in the UI. Finally, at 80% load all non-download traffic is rejected. In other words, at least 20% of database connections are reserved for handling download traffic. By choosing to drop other requests, there should be sufficient database connections available to avoid queuing and timeouts on download requests. There is some overlap with the LogConnectionPoolStatus middleware. These may eventually be consolidated to avoid some duplicate work and to make smarter decisions regarding the instantaneous spare capacity of individual pools (primary vs read-only replica). The existing middleware does give us some insight into our current in_flight_request counts on production. Looking through the logs, it is rare to have more than a few requests running at the same time. This is because download requests are completed very quickly and other API traffic accounts for about 10 requests per second.

To serve uploaded crates during local development, the middleware must serve the request before the ember rewrite occurs. Additionally, these requests should not affect the tally in the `BalanceCapacity` middleware.

pietroalbini

This mostly looks good!

The numbers seems fine, but I'd prefer to have them as environment variables with default values: if a need to tweak them during an outage arises I don't want to have to look into the source to find where they're defined and do a full deploy.

src/middleware/balance_capacity.rs

pietroalbini · 2020-05-04T14:24:41Z

src/middleware/balance_capacity.rs

+
+fn over_capcity_response() -> AfterResult {
+    // TODO: Generate an alert so we can investigate
+    let body = "Service temporarily unavailable";


I think we'll want an explicit dropped_due_to_low_capacity=true or similar item in the log, to know at a glance why that request returned a 503.

... and fix a typo

jtgeibel · 2020-05-05T01:15:52Z

Thanks for the feedback! I've pushed a new commit to fix the typo, add logging to requests rejected due to capacity, and allow environment variables to override the hard-coded defaults.

pietroalbini · 2020-05-05T13:10:16Z

This looks good!

@bors r+

bors · 2020-05-05T13:10:17Z

📌 Commit 3a8bb56 has been approved by pietroalbini

bors · 2020-05-05T13:10:25Z

⌛ Testing commit 3a8bb56 with merge 88b53bd...

bors · 2020-05-05T13:17:39Z

☀️ Test successful - checks-travis
Approved by: pietroalbini
Pushing 88b53bd to master...

rust-highfive assigned pietroalbini Apr 30, 2020

rust-highfive added the S-waiting-on-review label Apr 30, 2020

jtgeibel force-pushed the balance-db-pool-usage branch from 168e47e to 35734fe Compare April 30, 2020 23:05

jtgeibel added 3 commits May 3, 2020 11:18

Avoid cloning app config where possible

2867d08

Fix serving local_uploads during development

944a98d

To serve uploaded crates during local development, the middleware must serve the request before the ember rewrite occurs. Additionally, these requests should not affect the tally in the `BalanceCapacity` middleware.

jtgeibel force-pushed the balance-db-pool-usage branch from 34caeb9 to 944a98d Compare May 3, 2020 15:20

pietroalbini requested changes May 4, 2020

View reviewed changes

Allow load limits to be configured from environment

3a8bb56

... and fix a typo

pietroalbini approved these changes May 5, 2020

View reviewed changes

bors merged commit 88b53bd into rust-lang:master May 5, 2020

jtgeibel deleted the balance-db-pool-usage branch May 11, 2020 23:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add middleware to prioritize download traffic #2479

Add middleware to prioritize download traffic #2479

Uh oh!

jtgeibel commented Apr 30, 2020 •

edited

Loading

Uh oh!

bors commented May 3, 2020

Uh oh!

pietroalbini left a comment

Uh oh!

Uh oh!

pietroalbini May 4, 2020

Uh oh!

jtgeibel commented May 5, 2020

Uh oh!

pietroalbini commented May 5, 2020

Uh oh!

bors commented May 5, 2020

Uh oh!

bors commented May 5, 2020

Uh oh!

bors commented May 5, 2020

Uh oh!

Uh oh!

Add middleware to prioritize download traffic #2479

Add middleware to prioritize download traffic #2479

Uh oh!

Conversation

jtgeibel commented Apr 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bors commented May 3, 2020

Uh oh!

pietroalbini left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pietroalbini May 4, 2020

Choose a reason for hiding this comment

Uh oh!

jtgeibel commented May 5, 2020

Uh oh!

pietroalbini commented May 5, 2020

Uh oh!

bors commented May 5, 2020

Uh oh!

bors commented May 5, 2020

Uh oh!

bors commented May 5, 2020

Uh oh!

Uh oh!

jtgeibel commented Apr 30, 2020 •

edited

Loading