Skip to content

Optimize queries that need to sort on recent_crate_downloads.downloads #1755

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

sgrif
Copy link
Contributor

@sgrif sgrif commented May 29, 2019

I'm seeing two queries that are beginning to slow down beyond acceptable
levels, both sort by recent_crate_downloads.downloads. The first one
is the "Most Recent Downloads" section on the summary page, which is
just solved by adding an index on the downloads column.

The second case is the "recent downloads" sort on search, which is a bit
trickier. When the column is being sorted from the right side of a left
join, the index can no longer be used (since there might not be a row
there at all). The only way to fix this is to switch to using an inner
join, which would mean that crates that haven't been downloaded in the
last 90 days won't show up in search. In practice this only occurs for
very recently uploaded crates, since crawlers like libs.rs end up
downloading every crate. However, it could still lead to confusion, so
I've redefined the view to make sure there's a row present for every
crate. The crate still won't show up in search until the view gets
updated though, which occurs every 10 minutes.

This did cause some problems in our tests though, since a bunch of tests
don't have any recent downloads, but assume that the crate is present in
search, etc. Initially I tried just always updating the view in
PublishBuilder, but since refreshing the view grabs an exclusive lock,
it meant our test suite was much less parallel, and it also exacerbated
tests that can deadlock (pretty much every test in teams.rs has a
deadlocking issue, but fixing it requires none of the tests using the
same team name which is non-trivial).

Ultimately I opted to only refresh the view on the endpoints that will
be querying it, which means the tests don't need to care about whether
they're depending on this or not, and tests which never hit those
endpoints won't block waiting for other tests that do. The
implementation was a tad tricky. We need to ignore errors that occur
refreshing it since we might be in read only mode. I also had to put the
generic constraints for C on the struct itself instead of just the
Handler impl, otherwise the closures passed to it in the unit tests
for the router didn't seem to infer the correct closure type.

Two other options would be to add the index without the join change/view
definition change, which would mean that the summary page gets fixed,
but search is still slow. Summary is much more frequently hit than
search ordered by recent downloads, so this wouldn't be the end of the
world.

The other option would be to run our test suite with --test-threads=2
(interestingly, 2 test threads does not run tests in parallel, and our
suite blows up with --test-threads=1). If we did this, we could
refresh the view from CrateBuilder (and the handful of tests which
actually hit the publish endpoint), so it wouldn't really remove any
complexity, it'd just move it to the tests directory.

Before:

                                                                            QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=5497.84..5497.89 rows=100 width=882) (actual time=52.612..52.628 rows=100 loops=1)
   InitPlan 1 (returns $0)
     ->  Limit  (cost=1418.25..1418.26 rows=1 width=8) (actual time=8.106..8.107 rows=1 loops=1)
           ->  Aggregate  (cost=1418.25..1418.26 rows=1 width=8) (actual time=8.105..8.105 rows=1 loops=1)
                 ->  Index Only Scan using packages_pkey on crates crates_1  (cost=0.06..1405.16 rows=26180 width=0) (actual time=0.011..6.732 rows=26181 loops=1)
                       Heap Fetches: 7526
   ->  Sort  (cost=4079.58..4092.67 rows=26180 width=882) (actual time=52.610..52.619 rows=100 loops=1)
         Sort Key: recent_crate_downloads.downloads DESC NULLS LAST
         Sort Method: top-N heapsort  Memory: 160kB
         ->  Hash Left Join  (cost=466.18..3879.46 rows=26180 width=882) (actual time=16.495..38.755 rows=26181 loops=1)
               Hash Cond: (crates.id = recent_crate_downloads.crate_id)
               ->  Seq Scan on crates  (cost=0.00..3399.54 rows=26180 width=865) (actual time=0.007..14.165 rows=26181 loops=1)
               ->  Hash  (cost=374.54..374.54 rows=26181 width=12) (actual time=8.338..8.338 rows=26181 loops=1)
                     Buckets: 32768  Batches: 1  Memory Usage: 1381kB
                     ->  Seq Scan on recent_crate_downloads  (cost=0.00..374.54 rows=26181 width=12) (actual time=0.004..4.592 rows=26181 loops=1)
 Planning Time: 0.341 ms
 Execution Time: 52.691 ms

After:

                                                                             QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1418.37..1454.69 rows=100 width=882) (actual time=8.131..8.471 rows=100 loops=1)
   InitPlan 1 (returns $0)
     ->  Limit  (cost=1418.25..1418.26 rows=1 width=8) (actual time=8.095..8.095 rows=1 loops=1)
           ->  Aggregate  (cost=1418.25..1418.26 rows=1 width=8) (actual time=8.094..8.094 rows=1 loops=1)
                 ->  Index Only Scan using packages_pkey on crates crates_1  (cost=0.06..1405.16 rows=26180 width=0) (actual time=0.007..6.677 rows=26181 loops=1)
                       Heap Fetches: 7526
   ->  Nested Loop  (cost=0.12..9509.83 rows=26180 width=882) (actual time=8.131..8.461 rows=100 loops=1)
         ->  Index Scan Backward using sgrif_testing on recent_crate_downloads  (cost=0.06..890.88 rows=26181 width=12) (actual time=0.024..0.101 rows=100 loops=1)
         ->  Index Scan using packages_pkey on crates  (cost=0.06..0.33 rows=1 width=865) (actual time=0.002..0.002 rows=1 loops=100)
               Index Cond: (id = recent_crate_downloads.crate_id)
 Planning Time: 0.321 ms
 Execution Time: 8.521 ms

I'm seeing two queries that are beginning to slow down beyond acceptable
levels, both sort by `recent_crate_downloads.downloads`. The first one
is the "Most Recent Downloads" section on the summary page, which is
just solved by adding an index on the downloads column.

The second case is the "recent downloads" sort on search, which is a bit
trickier. When the column is being sorted from the right side of a left
join, the index can no longer be used (since there might not be a row
there at all). The only way to fix this is to switch to using an inner
join, which would mean that crates that haven't been downloaded in the
last 90 days won't show up in search. In practice this only occurs for
very recently uploaded crates, since crawlers like libs.rs end up
downloading every crate. However, it could still lead to confusion, so
I've redefined the view to make sure there's a row present for every
crate. The crate still won't show up in search until the view gets
updated though, which occurs every 10 minutes.

This did cause some problems in our tests though, since a bunch of tests
don't have any recent downloads, but assume that the crate is present in
search, etc. Initially I tried just always updating the view in
`PublishBuilder`, but since refreshing the view grabs an exclusive lock,
it meant our test suite was much less parallel, and it also exacerbated
tests that can deadlock (pretty much every test in teams.rs has a
deadlocking issue, but fixing it requires none of the tests using the
same team name which is non-trivial).

Ultimately I opted to only refresh the view on the endpoints that will
be querying it, which means the tests don't need to care about whether
they're depending on this or not, and tests which never hit those
endpoints won't block waiting for other tests that do. The
implementation was a tad tricky. We need to ignore errors that occur
refreshing it since we might be in read only mode. I also had to put the
generic constraints for `C` on the struct itself instead of just the
`Handler` impl, otherwise the closures passed to it in the unit tests
for the router didn't seem to infer the correct closure type.

Two other options would be to add the index without the join change/view
definition change, which would mean that the summary page gets fixed,
but search is still slow. Summary is much more frequently hit than
search ordered by recent downloads, so this wouldn't be the end of the
world.

The other option would be to run our test suite with `--test-threads=2`
(interestingly, 2 test threads does not run tests in parallel, and our
suite blows up with `--test-threads=1`). If we did this, we could
refresh the view from `CrateBuilder` (and the handful of tests which
*actually* hit the publish endpoint), so it wouldn't really remove any
complexity, it'd just move it to the tests directory.

Before:

```
                                                                            QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=5497.84..5497.89 rows=100 width=882) (actual time=52.612..52.628 rows=100 loops=1)
   InitPlan 1 (returns $0)
     ->  Limit  (cost=1418.25..1418.26 rows=1 width=8) (actual time=8.106..8.107 rows=1 loops=1)
           ->  Aggregate  (cost=1418.25..1418.26 rows=1 width=8) (actual time=8.105..8.105 rows=1 loops=1)
                 ->  Index Only Scan using packages_pkey on crates crates_1  (cost=0.06..1405.16 rows=26180 width=0) (actual time=0.011..6.732 rows=26181 loops=1)
                       Heap Fetches: 7526
   ->  Sort  (cost=4079.58..4092.67 rows=26180 width=882) (actual time=52.610..52.619 rows=100 loops=1)
         Sort Key: recent_crate_downloads.downloads DESC NULLS LAST
         Sort Method: top-N heapsort  Memory: 160kB
         ->  Hash Left Join  (cost=466.18..3879.46 rows=26180 width=882) (actual time=16.495..38.755 rows=26181 loops=1)
               Hash Cond: (crates.id = recent_crate_downloads.crate_id)
               ->  Seq Scan on crates  (cost=0.00..3399.54 rows=26180 width=865) (actual time=0.007..14.165 rows=26181 loops=1)
               ->  Hash  (cost=374.54..374.54 rows=26181 width=12) (actual time=8.338..8.338 rows=26181 loops=1)
                     Buckets: 32768  Batches: 1  Memory Usage: 1381kB
                     ->  Seq Scan on recent_crate_downloads  (cost=0.00..374.54 rows=26181 width=12) (actual time=0.004..4.592 rows=26181 loops=1)
 Planning Time: 0.341 ms
 Execution Time: 52.691 ms
```

After:

```
                                                                            QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=5497.84..5497.89 rows=100 width=882) (actual time=57.899..57.917 rows=100 loops=1)
   InitPlan 1 (returns $0)
     ->  Limit  (cost=1418.25..1418.26 rows=1 width=8) (actual time=9.722..9.723 rows=1 loops=1)
           ->  Aggregate  (cost=1418.25..1418.26 rows=1 width=8) (actual time=9.721..9.721 rows=1 loops=1)
                 ->  Index Only Scan using packages_pkey on crates crates_1  (cost=0.06..1405.16 rows=26180 width=0) (actual time=0.034..8.335 rows=26181 loops=1)
                       Heap Fetches: 7526
   ->  Sort  (cost=4079.58..4092.67 rows=26180 width=882) (actual time=57.898..57.907 rows=100 loops=1)
         Sort Key: recent_crate_downloads.downloads DESC
         Sort Method: top-N heapsort  Memory: 160kB
         ->  Hash Left Join  (cost=466.18..3879.46 rows=26180 width=882) (actual time=19.839..44.014 rows=26181 loops=1)
               Hash Cond: (crates.id = recent_crate_downloads.crate_id)
               ->  Seq Scan on crates  (cost=0.00..3399.54 rows=26180 width=865) (actual time=0.010..15.211 rows=26181 loops=1)
               ->  Hash  (cost=374.54..374.54 rows=26181 width=12) (actual time=9.941..9.941 rows=26181 loops=1)
                     Buckets: 32768  Batches: 1  Memory Usage: 1381kB
                     ->  Seq Scan on recent_crate_downloads  (cost=0.00..374.54 rows=26181 width=12) (actual time=0.008..5.128 rows=26181 loops=1)
 Planning Time: 1.287 ms
 Execution Time: 58.018 ms
```
@jtgeibel
Copy link
Member

The crate still won't show up in search until the view gets
updated though, which occurs every 10 minutes.

Its probably fairly common for a user to search for their crate shortly after publishing for the first time. This is also a bit further reaching, as the dashboard relies on the search endpoint. It hits crates?user_id=123 and crates?following=1. (The list of crates on team pages are also populated by search results from this endpoint.)

Especially for search results and the dashboard, I'm afraid a delay of up to 10 minutes in results could confuse users.

Two other options would be to add the index without the join change/view
definition change, which would mean that the summary page gets fixed,
but search is still slow. Summary is much more frequently hit than
search ordered by recent downloads, so this wouldn't be the end of the
world.

I think this alternative sounds like a solid improvement that avoids the potential for confusing search results in various places in the frontend.

@sgrif
Copy link
Contributor Author

sgrif commented Aug 19, 2019

I think I'd be fine with leaving this slow on search if we change the default ordering to something else, so crawlers which want this have to explicitly request it -- Right now this query on search accounts for 40% of our total DB load, but I suspect that will go down if we exclude all the crawlers which don't care about ordering. I will open a new PR for this

@sgrif sgrif closed this Aug 19, 2019
@sgrif sgrif deleted the sg-optimize-download-sorting branch August 19, 2019 18:22
@smarnach
Copy link
Contributor

Once we provide database dumps, the load from crawlers will likely reduce over time as well.

sgrif added a commit to sgrif/crates.io that referenced this pull request Aug 19, 2019
This changes the behavior of `/api/v1/crates` if no sort option is
provided. This does not change the behavior of the website when accessed
through a browser, as the Ember frontend always specifies a sort field.

The query run by search when sorting by recent downloads currently
accounts for nearly 40% of our total DB load. It takes ~60ms on average,
which means it's extremely high load. Our logs for traffic to this
endpoint with an explicit `sort=recent-downloads` don't match up with
this. We can fix this by making the query faster, but this has drawbacks
that were determined not to be worth it (see rust-lang#1755). So if we can't
decrease execution time, the only other option is to decrease execution
count.

A significant number of crawlers hit us without specifying any sorting,
which should mean they don't care about ordering. I'm sure there's
someone out there who is relying on this behavior, but most crawlers
I've seen don't specify ordering, and none of them care about the order
they get the results in since they're crawling the whole registry
anyway.

What's important is that this lowers the execution time from ~60ms to
~12ms. And sorting on recent_downloads will grow linearly with the table
size, while this can be done on an index alone (which should be
log(log(n)) IIRC). >90% of the query is spent on getting the count,
which will be going away soon as we change pagination strategies,
bringing us to sub-millisecond execution.

I do feel strongly that we either need to make this change, or accept
the drawbacks in rust-lang#1755, as this will soon become a scaling issue for us.
sgrif added a commit to sgrif/crates.io that referenced this pull request Aug 19, 2019
This changes the behavior of `/api/v1/crates` if no sort option is
provided. This does not change the behavior of the website when accessed
through a browser, as the Ember frontend always specifies a sort field.

The query run by search when sorting by recent downloads currently
accounts for nearly 40% of our total DB load. It takes ~60ms on average,
which means it's extremely high load. Our logs for traffic to this
endpoint with an explicit `sort=recent-downloads` don't match up with
this. We can fix this by making the query faster, but this has drawbacks
that were determined not to be worth it (see rust-lang#1755). So if we can't
decrease execution time, the only other option is to decrease execution
count.

A significant number of crawlers hit us without specifying any sorting,
which should mean they don't care about ordering. I'm sure there's
someone out there who is relying on this behavior, but most crawlers
I've seen don't specify ordering, and none of them care about the order
they get the results in since they're crawling the whole registry
anyway.

What's important is that this lowers the execution time from ~60ms to
~12ms. And sorting on recent_downloads will grow linearly with the table
size, while this can be done on an index alone (which should be
log(log(n)) IIRC). >90% of the query is spent on getting the count,
which will be going away soon as we change pagination strategies,
bringing us to sub-millisecond execution.

I do feel strongly that we either need to make this change, or accept
the drawbacks in rust-lang#1755, as this will soon become a scaling issue for us.
sgrif added a commit to sgrif/crates.io that referenced this pull request Sep 5, 2019
This changes the behavior of `/api/v1/crates` if no sort option is
provided. This does not change the behavior of the website when accessed
through a browser, as the Ember frontend always specifies a sort field.

The query run by search when sorting by recent downloads currently
accounts for nearly 40% of our total DB load. It takes ~60ms on average,
which means it's extremely high load. Our logs for traffic to this
endpoint with an explicit `sort=recent-downloads` don't match up with
this. We can fix this by making the query faster, but this has drawbacks
that were determined not to be worth it (see rust-lang#1755). So if we can't
decrease execution time, the only other option is to decrease execution
count.

A significant number of crawlers hit us without specifying any sorting,
which should mean they don't care about ordering. I'm sure there's
someone out there who is relying on this behavior, but most crawlers
I've seen don't specify ordering, and none of them care about the order
they get the results in since they're crawling the whole registry
anyway.

What's important is that this lowers the execution time from ~60ms to
~12ms. And sorting on recent_downloads will grow linearly with the table
size, while this can be done on an index alone (which should be
log(log(n)) IIRC). >90% of the query is spent on getting the count,
which will be going away soon as we change pagination strategies,
bringing us to sub-millisecond execution.

I do feel strongly that we either need to make this change, or accept
the drawbacks in rust-lang#1755, as this will soon become a scaling issue for us.
sgrif added a commit to sgrif/crates.io that referenced this pull request Sep 5, 2019
This changes the behavior of `/api/v1/crates` if no sort option is
provided. This does not change the behavior of the website when accessed
through a browser, as the Ember frontend always specifies a sort field.

The query run by search when sorting by recent downloads currently
accounts for nearly 40% of our total DB load. It takes ~60ms on average,
which means it's extremely high load. Our logs for traffic to this
endpoint with an explicit `sort=recent-downloads` don't match up with
this. We can fix this by making the query faster, but this has drawbacks
that were determined not to be worth it (see rust-lang#1755). So if we can't
decrease execution time, the only other option is to decrease execution
count.

A significant number of crawlers hit us without specifying any sorting,
which should mean they don't care about ordering. I'm sure there's
someone out there who is relying on this behavior, but most crawlers
I've seen don't specify ordering, and none of them care about the order
they get the results in since they're crawling the whole registry
anyway.

What's important is that this lowers the execution time from ~60ms to
~12ms. And sorting on recent_downloads will grow linearly with the table
size, while this can be done on an index alone (which should be
log(log(n)) IIRC). >90% of the query is spent on getting the count,
which will be going away soon as we change pagination strategies,
bringing us to sub-millisecond execution.

I do feel strongly that we either need to make this change, or accept
the drawbacks in rust-lang#1755, as this will soon become a scaling issue for us.
sgrif added a commit to sgrif/crates.io that referenced this pull request Sep 5, 2019
This changes the behavior of `/api/v1/crates` if no sort option is
provided. This does not change the behavior of the website when accessed
through a browser, as the Ember frontend always specifies a sort field.

The query run by search when sorting by recent downloads currently
accounts for nearly 40% of our total DB load. It takes ~60ms on average,
which means it's extremely high load. Our logs for traffic to this
endpoint with an explicit `sort=recent-downloads` don't match up with
this. We can fix this by making the query faster, but this has drawbacks
that were determined not to be worth it (see rust-lang#1755). So if we can't
decrease execution time, the only other option is to decrease execution
count.

A significant number of crawlers hit us without specifying any sorting,
which should mean they don't care about ordering. I'm sure there's
someone out there who is relying on this behavior, but most crawlers
I've seen don't specify ordering, and none of them care about the order
they get the results in since they're crawling the whole registry
anyway.

What's important is that this lowers the execution time from ~60ms to
~12ms. And sorting on recent_downloads will grow linearly with the table
size, while this can be done on an index alone (which should be
log(log(n)) IIRC). >90% of the query is spent on getting the count,
which will be going away soon as we change pagination strategies,
bringing us to sub-millisecond execution.

I do feel strongly that we either need to make this change, or accept
the drawbacks in rust-lang#1755, as this will soon become a scaling issue for us.
bors added a commit that referenced this pull request Sep 11, 2019
…ibel

Change the default sorting on search to something less expensive

This changes the behavior of `/api/v1/crates` if no sort option is
provided. This does not change the behavior of the website when accessed
through a browser, as the Ember frontend always specifies a sort field.

The query run by search when sorting by recent downloads currently
accounts for nearly 40% of our total DB load. It takes ~60ms on average,
which means it's extremely high load. Our logs for traffic to this
endpoint with an explicit `sort=recent-downloads` don't match up with
this. We can fix this by making the query faster, but this has drawbacks
that were determined not to be worth it (see #1755). So if we can't
decrease execution time, the only other option is to decrease execution
count.

A significant number of crawlers hit us without specifying any sorting,
which should mean they don't care about ordering. I'm sure there's
someone out there who is relying on this behavior, but most crawlers
I've seen don't specify ordering, and none of them care about the order
they get the results in since they're crawling the whole registry
anyway.

What's important is that this lowers the execution time from ~60ms to
~12ms. And sorting on recent_downloads will grow linearly with the table
size, while this can be done on an index alone (which should be
log(log(n)) IIRC). >90% of the query is spent on getting the count,
which will be going away soon as we change pagination strategies,
bringing us to sub-millisecond execution.

I do feel strongly that we either need to make this change, or accept
the drawbacks in #1755, as this will soon become a scaling issue for us.

r? @jtgeibel
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants