Optimize queries that need to sort on `recent_crate_downloads.downloads` #1755

sgrif · 2019-05-29T21:28:51Z

I'm seeing two queries that are beginning to slow down beyond acceptable
levels, both sort by recent_crate_downloads.downloads. The first one
is the "Most Recent Downloads" section on the summary page, which is
just solved by adding an index on the downloads column.

The second case is the "recent downloads" sort on search, which is a bit
trickier. When the column is being sorted from the right side of a left
join, the index can no longer be used (since there might not be a row
there at all). The only way to fix this is to switch to using an inner
join, which would mean that crates that haven't been downloaded in the
last 90 days won't show up in search. In practice this only occurs for
very recently uploaded crates, since crawlers like libs.rs end up
downloading every crate. However, it could still lead to confusion, so
I've redefined the view to make sure there's a row present for every
crate. The crate still won't show up in search until the view gets
updated though, which occurs every 10 minutes.

This did cause some problems in our tests though, since a bunch of tests
don't have any recent downloads, but assume that the crate is present in
search, etc. Initially I tried just always updating the view in
PublishBuilder, but since refreshing the view grabs an exclusive lock,
it meant our test suite was much less parallel, and it also exacerbated
tests that can deadlock (pretty much every test in teams.rs has a
deadlocking issue, but fixing it requires none of the tests using the
same team name which is non-trivial).

Ultimately I opted to only refresh the view on the endpoints that will
be querying it, which means the tests don't need to care about whether
they're depending on this or not, and tests which never hit those
endpoints won't block waiting for other tests that do. The
implementation was a tad tricky. We need to ignore errors that occur
refreshing it since we might be in read only mode. I also had to put the
generic constraints for C on the struct itself instead of just the
Handler impl, otherwise the closures passed to it in the unit tests
for the router didn't seem to infer the correct closure type.

Two other options would be to add the index without the join change/view
definition change, which would mean that the summary page gets fixed,
but search is still slow. Summary is much more frequently hit than
search ordered by recent downloads, so this wouldn't be the end of the
world.

The other option would be to run our test suite with --test-threads=2
(interestingly, 2 test threads does not run tests in parallel, and our
suite blows up with --test-threads=1). If we did this, we could
refresh the view from CrateBuilder (and the handful of tests which
actually hit the publish endpoint), so it wouldn't really remove any
complexity, it'd just move it to the tests directory.

Before:

                                                                            QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=5497.84..5497.89 rows=100 width=882) (actual time=52.612..52.628 rows=100 loops=1)
   InitPlan 1 (returns $0)
     ->  Limit  (cost=1418.25..1418.26 rows=1 width=8) (actual time=8.106..8.107 rows=1 loops=1)
           ->  Aggregate  (cost=1418.25..1418.26 rows=1 width=8) (actual time=8.105..8.105 rows=1 loops=1)
                 ->  Index Only Scan using packages_pkey on crates crates_1  (cost=0.06..1405.16 rows=26180 width=0) (actual time=0.011..6.732 rows=26181 loops=1)
                       Heap Fetches: 7526
   ->  Sort  (cost=4079.58..4092.67 rows=26180 width=882) (actual time=52.610..52.619 rows=100 loops=1)
         Sort Key: recent_crate_downloads.downloads DESC NULLS LAST
         Sort Method: top-N heapsort  Memory: 160kB
         ->  Hash Left Join  (cost=466.18..3879.46 rows=26180 width=882) (actual time=16.495..38.755 rows=26181 loops=1)
               Hash Cond: (crates.id = recent_crate_downloads.crate_id)
               ->  Seq Scan on crates  (cost=0.00..3399.54 rows=26180 width=865) (actual time=0.007..14.165 rows=26181 loops=1)
               ->  Hash  (cost=374.54..374.54 rows=26181 width=12) (actual time=8.338..8.338 rows=26181 loops=1)
                     Buckets: 32768  Batches: 1  Memory Usage: 1381kB
                     ->  Seq Scan on recent_crate_downloads  (cost=0.00..374.54 rows=26181 width=12) (actual time=0.004..4.592 rows=26181 loops=1)
 Planning Time: 0.341 ms
 Execution Time: 52.691 ms

After:

                                                                             QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1418.37..1454.69 rows=100 width=882) (actual time=8.131..8.471 rows=100 loops=1)
   InitPlan 1 (returns $0)
     ->  Limit  (cost=1418.25..1418.26 rows=1 width=8) (actual time=8.095..8.095 rows=1 loops=1)
           ->  Aggregate  (cost=1418.25..1418.26 rows=1 width=8) (actual time=8.094..8.094 rows=1 loops=1)
                 ->  Index Only Scan using packages_pkey on crates crates_1  (cost=0.06..1405.16 rows=26180 width=0) (actual time=0.007..6.677 rows=26181 loops=1)
                       Heap Fetches: 7526
   ->  Nested Loop  (cost=0.12..9509.83 rows=26180 width=882) (actual time=8.131..8.461 rows=100 loops=1)
         ->  Index Scan Backward using sgrif_testing on recent_crate_downloads  (cost=0.06..890.88 rows=26181 width=12) (actual time=0.024..0.101 rows=100 loops=1)
         ->  Index Scan using packages_pkey on crates  (cost=0.06..0.33 rows=1 width=865) (actual time=0.002..0.002 rows=1 loops=100)
               Index Cond: (id = recent_crate_downloads.crate_id)
 Planning Time: 0.321 ms
 Execution Time: 8.521 ms

I'm seeing two queries that are beginning to slow down beyond acceptable levels, both sort by `recent_crate_downloads.downloads`. The first one is the "Most Recent Downloads" section on the summary page, which is just solved by adding an index on the downloads column. The second case is the "recent downloads" sort on search, which is a bit trickier. When the column is being sorted from the right side of a left join, the index can no longer be used (since there might not be a row there at all). The only way to fix this is to switch to using an inner join, which would mean that crates that haven't been downloaded in the last 90 days won't show up in search. In practice this only occurs for very recently uploaded crates, since crawlers like libs.rs end up downloading every crate. However, it could still lead to confusion, so I've redefined the view to make sure there's a row present for every crate. The crate still won't show up in search until the view gets updated though, which occurs every 10 minutes. This did cause some problems in our tests though, since a bunch of tests don't have any recent downloads, but assume that the crate is present in search, etc. Initially I tried just always updating the view in `PublishBuilder`, but since refreshing the view grabs an exclusive lock, it meant our test suite was much less parallel, and it also exacerbated tests that can deadlock (pretty much every test in teams.rs has a deadlocking issue, but fixing it requires none of the tests using the same team name which is non-trivial). Ultimately I opted to only refresh the view on the endpoints that will be querying it, which means the tests don't need to care about whether they're depending on this or not, and tests which never hit those endpoints won't block waiting for other tests that do. The implementation was a tad tricky. We need to ignore errors that occur refreshing it since we might be in read only mode. I also had to put the generic constraints for `C` on the struct itself instead of just the `Handler` impl, otherwise the closures passed to it in the unit tests for the router didn't seem to infer the correct closure type. Two other options would be to add the index without the join change/view definition change, which would mean that the summary page gets fixed, but search is still slow. Summary is much more frequently hit than search ordered by recent downloads, so this wouldn't be the end of the world. The other option would be to run our test suite with `--test-threads=2` (interestingly, 2 test threads does not run tests in parallel, and our suite blows up with `--test-threads=1`). If we did this, we could refresh the view from `CrateBuilder` (and the handful of tests which *actually* hit the publish endpoint), so it wouldn't really remove any complexity, it'd just move it to the tests directory. Before: ``` QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=5497.84..5497.89 rows=100 width=882) (actual time=52.612..52.628 rows=100 loops=1) InitPlan 1 (returns $0) -> Limit (cost=1418.25..1418.26 rows=1 width=8) (actual time=8.106..8.107 rows=1 loops=1) -> Aggregate (cost=1418.25..1418.26 rows=1 width=8) (actual time=8.105..8.105 rows=1 loops=1) -> Index Only Scan using packages_pkey on crates crates_1 (cost=0.06..1405.16 rows=26180 width=0) (actual time=0.011..6.732 rows=26181 loops=1) Heap Fetches: 7526 -> Sort (cost=4079.58..4092.67 rows=26180 width=882) (actual time=52.610..52.619 rows=100 loops=1) Sort Key: recent_crate_downloads.downloads DESC NULLS LAST Sort Method: top-N heapsort Memory: 160kB -> Hash Left Join (cost=466.18..3879.46 rows=26180 width=882) (actual time=16.495..38.755 rows=26181 loops=1) Hash Cond: (crates.id = recent_crate_downloads.crate_id) -> Seq Scan on crates (cost=0.00..3399.54 rows=26180 width=865) (actual time=0.007..14.165 rows=26181 loops=1) -> Hash (cost=374.54..374.54 rows=26181 width=12) (actual time=8.338..8.338 rows=26181 loops=1) Buckets: 32768 Batches: 1 Memory Usage: 1381kB -> Seq Scan on recent_crate_downloads (cost=0.00..374.54 rows=26181 width=12) (actual time=0.004..4.592 rows=26181 loops=1) Planning Time: 0.341 ms Execution Time: 52.691 ms ``` After: ``` QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=5497.84..5497.89 rows=100 width=882) (actual time=57.899..57.917 rows=100 loops=1) InitPlan 1 (returns $0) -> Limit (cost=1418.25..1418.26 rows=1 width=8) (actual time=9.722..9.723 rows=1 loops=1) -> Aggregate (cost=1418.25..1418.26 rows=1 width=8) (actual time=9.721..9.721 rows=1 loops=1) -> Index Only Scan using packages_pkey on crates crates_1 (cost=0.06..1405.16 rows=26180 width=0) (actual time=0.034..8.335 rows=26181 loops=1) Heap Fetches: 7526 -> Sort (cost=4079.58..4092.67 rows=26180 width=882) (actual time=57.898..57.907 rows=100 loops=1) Sort Key: recent_crate_downloads.downloads DESC Sort Method: top-N heapsort Memory: 160kB -> Hash Left Join (cost=466.18..3879.46 rows=26180 width=882) (actual time=19.839..44.014 rows=26181 loops=1) Hash Cond: (crates.id = recent_crate_downloads.crate_id) -> Seq Scan on crates (cost=0.00..3399.54 rows=26180 width=865) (actual time=0.010..15.211 rows=26181 loops=1) -> Hash (cost=374.54..374.54 rows=26181 width=12) (actual time=9.941..9.941 rows=26181 loops=1) Buckets: 32768 Batches: 1 Memory Usage: 1381kB -> Seq Scan on recent_crate_downloads (cost=0.00..374.54 rows=26181 width=12) (actual time=0.008..5.128 rows=26181 loops=1) Planning Time: 1.287 ms Execution Time: 58.018 ms ```

jtgeibel · 2019-07-14T21:03:52Z

The crate still won't show up in search until the view gets
updated though, which occurs every 10 minutes.

Its probably fairly common for a user to search for their crate shortly after publishing for the first time. This is also a bit further reaching, as the dashboard relies on the search endpoint. It hits crates?user_id=123 and crates?following=1. (The list of crates on team pages are also populated by search results from this endpoint.)

Especially for search results and the dashboard, I'm afraid a delay of up to 10 minutes in results could confuse users.

Two other options would be to add the index without the join change/view
definition change, which would mean that the summary page gets fixed,
but search is still slow. Summary is much more frequently hit than
search ordered by recent downloads, so this wouldn't be the end of the
world.

I think this alternative sounds like a solid improvement that avoids the potential for confusing search results in various places in the frontend.

sgrif · 2019-08-19T18:21:26Z

I think I'd be fine with leaving this slow on search if we change the default ordering to something else, so crawlers which want this have to explicitly request it -- Right now this query on search accounts for 40% of our total DB load, but I suspect that will go down if we exclude all the crawlers which don't care about ordering. I will open a new PR for this

smarnach · 2019-08-19T18:27:10Z

Once we provide database dumps, the load from crawlers will likely reduce over time as well.

This changes the behavior of `/api/v1/crates` if no sort option is provided. This does not change the behavior of the website when accessed through a browser, as the Ember frontend always specifies a sort field. The query run by search when sorting by recent downloads currently accounts for nearly 40% of our total DB load. It takes ~60ms on average, which means it's extremely high load. Our logs for traffic to this endpoint with an explicit `sort=recent-downloads` don't match up with this. We can fix this by making the query faster, but this has drawbacks that were determined not to be worth it (see rust-lang#1755). So if we can't decrease execution time, the only other option is to decrease execution count. A significant number of crawlers hit us without specifying any sorting, which should mean they don't care about ordering. I'm sure there's someone out there who is relying on this behavior, but most crawlers I've seen don't specify ordering, and none of them care about the order they get the results in since they're crawling the whole registry anyway. What's important is that this lowers the execution time from ~60ms to ~12ms. And sorting on recent_downloads will grow linearly with the table size, while this can be done on an index alone (which should be log(log(n)) IIRC). >90% of the query is spent on getting the count, which will be going away soon as we change pagination strategies, bringing us to sub-millisecond execution. I do feel strongly that we either need to make this change, or accept the drawbacks in rust-lang#1755, as this will soon become a scaling issue for us.

@jtgeibel

…ibel Change the default sorting on search to something less expensive This changes the behavior of `/api/v1/crates` if no sort option is provided. This does not change the behavior of the website when accessed through a browser, as the Ember frontend always specifies a sort field. The query run by search when sorting by recent downloads currently accounts for nearly 40% of our total DB load. It takes ~60ms on average, which means it's extremely high load. Our logs for traffic to this endpoint with an explicit `sort=recent-downloads` don't match up with this. We can fix this by making the query faster, but this has drawbacks that were determined not to be worth it (see #1755). So if we can't decrease execution time, the only other option is to decrease execution count. A significant number of crawlers hit us without specifying any sorting, which should mean they don't care about ordering. I'm sure there's someone out there who is relying on this behavior, but most crawlers I've seen don't specify ordering, and none of them care about the order they get the results in since they're crawling the whole registry anyway. What's important is that this lowers the execution time from ~60ms to ~12ms. And sorting on recent_downloads will grow linearly with the table size, while this can be done on an index alone (which should be log(log(n)) IIRC). >90% of the query is spent on getting the count, which will be going away soon as we change pagination strategies, bringing us to sub-millisecond execution. I do feel strongly that we either need to make this change, or accept the drawbacks in #1755, as this will soon become a scaling issue for us. r? @jtgeibel

sgrif added the S-waiting-on-review label May 29, 2019

sgrif closed this Aug 19, 2019

sgrif deleted the sg-optimize-download-sorting branch August 19, 2019 18:22

sgrif mentioned this pull request Aug 19, 2019

Change the default sorting on search to something less expensive #1809

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize queries that need to sort on `recent_crate_downloads.downloads` #1755

Optimize queries that need to sort on `recent_crate_downloads.downloads` #1755

Uh oh!

sgrif commented May 29, 2019 •

edited

Loading

Uh oh!

jtgeibel commented Jul 14, 2019

Uh oh!

sgrif commented Aug 19, 2019

Uh oh!

smarnach commented Aug 19, 2019

Uh oh!

Uh oh!

Optimize queries that need to sort on recent_crate_downloads.downloads #1755

Optimize queries that need to sort on recent_crate_downloads.downloads #1755

Uh oh!

Conversation

sgrif commented May 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jtgeibel commented Jul 14, 2019

Uh oh!

sgrif commented Aug 19, 2019

Uh oh!

smarnach commented Aug 19, 2019

Uh oh!

Uh oh!

Optimize queries that need to sort on `recent_crate_downloads.downloads` #1755

Optimize queries that need to sort on `recent_crate_downloads.downloads` #1755

sgrif commented May 29, 2019 •

edited

Loading