-
Notifications
You must be signed in to change notification settings - Fork 642
Optimize queries that need to sort on recent_crate_downloads.downloads
#1755
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I'm seeing two queries that are beginning to slow down beyond acceptable levels, both sort by `recent_crate_downloads.downloads`. The first one is the "Most Recent Downloads" section on the summary page, which is just solved by adding an index on the downloads column. The second case is the "recent downloads" sort on search, which is a bit trickier. When the column is being sorted from the right side of a left join, the index can no longer be used (since there might not be a row there at all). The only way to fix this is to switch to using an inner join, which would mean that crates that haven't been downloaded in the last 90 days won't show up in search. In practice this only occurs for very recently uploaded crates, since crawlers like libs.rs end up downloading every crate. However, it could still lead to confusion, so I've redefined the view to make sure there's a row present for every crate. The crate still won't show up in search until the view gets updated though, which occurs every 10 minutes. This did cause some problems in our tests though, since a bunch of tests don't have any recent downloads, but assume that the crate is present in search, etc. Initially I tried just always updating the view in `PublishBuilder`, but since refreshing the view grabs an exclusive lock, it meant our test suite was much less parallel, and it also exacerbated tests that can deadlock (pretty much every test in teams.rs has a deadlocking issue, but fixing it requires none of the tests using the same team name which is non-trivial). Ultimately I opted to only refresh the view on the endpoints that will be querying it, which means the tests don't need to care about whether they're depending on this or not, and tests which never hit those endpoints won't block waiting for other tests that do. The implementation was a tad tricky. We need to ignore errors that occur refreshing it since we might be in read only mode. I also had to put the generic constraints for `C` on the struct itself instead of just the `Handler` impl, otherwise the closures passed to it in the unit tests for the router didn't seem to infer the correct closure type. Two other options would be to add the index without the join change/view definition change, which would mean that the summary page gets fixed, but search is still slow. Summary is much more frequently hit than search ordered by recent downloads, so this wouldn't be the end of the world. The other option would be to run our test suite with `--test-threads=2` (interestingly, 2 test threads does not run tests in parallel, and our suite blows up with `--test-threads=1`). If we did this, we could refresh the view from `CrateBuilder` (and the handful of tests which *actually* hit the publish endpoint), so it wouldn't really remove any complexity, it'd just move it to the tests directory. Before: ``` QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=5497.84..5497.89 rows=100 width=882) (actual time=52.612..52.628 rows=100 loops=1) InitPlan 1 (returns $0) -> Limit (cost=1418.25..1418.26 rows=1 width=8) (actual time=8.106..8.107 rows=1 loops=1) -> Aggregate (cost=1418.25..1418.26 rows=1 width=8) (actual time=8.105..8.105 rows=1 loops=1) -> Index Only Scan using packages_pkey on crates crates_1 (cost=0.06..1405.16 rows=26180 width=0) (actual time=0.011..6.732 rows=26181 loops=1) Heap Fetches: 7526 -> Sort (cost=4079.58..4092.67 rows=26180 width=882) (actual time=52.610..52.619 rows=100 loops=1) Sort Key: recent_crate_downloads.downloads DESC NULLS LAST Sort Method: top-N heapsort Memory: 160kB -> Hash Left Join (cost=466.18..3879.46 rows=26180 width=882) (actual time=16.495..38.755 rows=26181 loops=1) Hash Cond: (crates.id = recent_crate_downloads.crate_id) -> Seq Scan on crates (cost=0.00..3399.54 rows=26180 width=865) (actual time=0.007..14.165 rows=26181 loops=1) -> Hash (cost=374.54..374.54 rows=26181 width=12) (actual time=8.338..8.338 rows=26181 loops=1) Buckets: 32768 Batches: 1 Memory Usage: 1381kB -> Seq Scan on recent_crate_downloads (cost=0.00..374.54 rows=26181 width=12) (actual time=0.004..4.592 rows=26181 loops=1) Planning Time: 0.341 ms Execution Time: 52.691 ms ``` After: ``` QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=5497.84..5497.89 rows=100 width=882) (actual time=57.899..57.917 rows=100 loops=1) InitPlan 1 (returns $0) -> Limit (cost=1418.25..1418.26 rows=1 width=8) (actual time=9.722..9.723 rows=1 loops=1) -> Aggregate (cost=1418.25..1418.26 rows=1 width=8) (actual time=9.721..9.721 rows=1 loops=1) -> Index Only Scan using packages_pkey on crates crates_1 (cost=0.06..1405.16 rows=26180 width=0) (actual time=0.034..8.335 rows=26181 loops=1) Heap Fetches: 7526 -> Sort (cost=4079.58..4092.67 rows=26180 width=882) (actual time=57.898..57.907 rows=100 loops=1) Sort Key: recent_crate_downloads.downloads DESC Sort Method: top-N heapsort Memory: 160kB -> Hash Left Join (cost=466.18..3879.46 rows=26180 width=882) (actual time=19.839..44.014 rows=26181 loops=1) Hash Cond: (crates.id = recent_crate_downloads.crate_id) -> Seq Scan on crates (cost=0.00..3399.54 rows=26180 width=865) (actual time=0.010..15.211 rows=26181 loops=1) -> Hash (cost=374.54..374.54 rows=26181 width=12) (actual time=9.941..9.941 rows=26181 loops=1) Buckets: 32768 Batches: 1 Memory Usage: 1381kB -> Seq Scan on recent_crate_downloads (cost=0.00..374.54 rows=26181 width=12) (actual time=0.008..5.128 rows=26181 loops=1) Planning Time: 1.287 ms Execution Time: 58.018 ms ```
Its probably fairly common for a user to search for their crate shortly after publishing for the first time. This is also a bit further reaching, as the dashboard relies on the search endpoint. It hits Especially for search results and the dashboard, I'm afraid a delay of up to 10 minutes in results could confuse users.
I think this alternative sounds like a solid improvement that avoids the potential for confusing search results in various places in the frontend. |
I think I'd be fine with leaving this slow on search if we change the default ordering to something else, so crawlers which want this have to explicitly request it -- Right now this query on search accounts for 40% of our total DB load, but I suspect that will go down if we exclude all the crawlers which don't care about ordering. I will open a new PR for this |
Once we provide database dumps, the load from crawlers will likely reduce over time as well. |
This changes the behavior of `/api/v1/crates` if no sort option is provided. This does not change the behavior of the website when accessed through a browser, as the Ember frontend always specifies a sort field. The query run by search when sorting by recent downloads currently accounts for nearly 40% of our total DB load. It takes ~60ms on average, which means it's extremely high load. Our logs for traffic to this endpoint with an explicit `sort=recent-downloads` don't match up with this. We can fix this by making the query faster, but this has drawbacks that were determined not to be worth it (see rust-lang#1755). So if we can't decrease execution time, the only other option is to decrease execution count. A significant number of crawlers hit us without specifying any sorting, which should mean they don't care about ordering. I'm sure there's someone out there who is relying on this behavior, but most crawlers I've seen don't specify ordering, and none of them care about the order they get the results in since they're crawling the whole registry anyway. What's important is that this lowers the execution time from ~60ms to ~12ms. And sorting on recent_downloads will grow linearly with the table size, while this can be done on an index alone (which should be log(log(n)) IIRC). >90% of the query is spent on getting the count, which will be going away soon as we change pagination strategies, bringing us to sub-millisecond execution. I do feel strongly that we either need to make this change, or accept the drawbacks in rust-lang#1755, as this will soon become a scaling issue for us.
This changes the behavior of `/api/v1/crates` if no sort option is provided. This does not change the behavior of the website when accessed through a browser, as the Ember frontend always specifies a sort field. The query run by search when sorting by recent downloads currently accounts for nearly 40% of our total DB load. It takes ~60ms on average, which means it's extremely high load. Our logs for traffic to this endpoint with an explicit `sort=recent-downloads` don't match up with this. We can fix this by making the query faster, but this has drawbacks that were determined not to be worth it (see rust-lang#1755). So if we can't decrease execution time, the only other option is to decrease execution count. A significant number of crawlers hit us without specifying any sorting, which should mean they don't care about ordering. I'm sure there's someone out there who is relying on this behavior, but most crawlers I've seen don't specify ordering, and none of them care about the order they get the results in since they're crawling the whole registry anyway. What's important is that this lowers the execution time from ~60ms to ~12ms. And sorting on recent_downloads will grow linearly with the table size, while this can be done on an index alone (which should be log(log(n)) IIRC). >90% of the query is spent on getting the count, which will be going away soon as we change pagination strategies, bringing us to sub-millisecond execution. I do feel strongly that we either need to make this change, or accept the drawbacks in rust-lang#1755, as this will soon become a scaling issue for us.
This changes the behavior of `/api/v1/crates` if no sort option is provided. This does not change the behavior of the website when accessed through a browser, as the Ember frontend always specifies a sort field. The query run by search when sorting by recent downloads currently accounts for nearly 40% of our total DB load. It takes ~60ms on average, which means it's extremely high load. Our logs for traffic to this endpoint with an explicit `sort=recent-downloads` don't match up with this. We can fix this by making the query faster, but this has drawbacks that were determined not to be worth it (see rust-lang#1755). So if we can't decrease execution time, the only other option is to decrease execution count. A significant number of crawlers hit us without specifying any sorting, which should mean they don't care about ordering. I'm sure there's someone out there who is relying on this behavior, but most crawlers I've seen don't specify ordering, and none of them care about the order they get the results in since they're crawling the whole registry anyway. What's important is that this lowers the execution time from ~60ms to ~12ms. And sorting on recent_downloads will grow linearly with the table size, while this can be done on an index alone (which should be log(log(n)) IIRC). >90% of the query is spent on getting the count, which will be going away soon as we change pagination strategies, bringing us to sub-millisecond execution. I do feel strongly that we either need to make this change, or accept the drawbacks in rust-lang#1755, as this will soon become a scaling issue for us.
This changes the behavior of `/api/v1/crates` if no sort option is provided. This does not change the behavior of the website when accessed through a browser, as the Ember frontend always specifies a sort field. The query run by search when sorting by recent downloads currently accounts for nearly 40% of our total DB load. It takes ~60ms on average, which means it's extremely high load. Our logs for traffic to this endpoint with an explicit `sort=recent-downloads` don't match up with this. We can fix this by making the query faster, but this has drawbacks that were determined not to be worth it (see rust-lang#1755). So if we can't decrease execution time, the only other option is to decrease execution count. A significant number of crawlers hit us without specifying any sorting, which should mean they don't care about ordering. I'm sure there's someone out there who is relying on this behavior, but most crawlers I've seen don't specify ordering, and none of them care about the order they get the results in since they're crawling the whole registry anyway. What's important is that this lowers the execution time from ~60ms to ~12ms. And sorting on recent_downloads will grow linearly with the table size, while this can be done on an index alone (which should be log(log(n)) IIRC). >90% of the query is spent on getting the count, which will be going away soon as we change pagination strategies, bringing us to sub-millisecond execution. I do feel strongly that we either need to make this change, or accept the drawbacks in rust-lang#1755, as this will soon become a scaling issue for us.
This changes the behavior of `/api/v1/crates` if no sort option is provided. This does not change the behavior of the website when accessed through a browser, as the Ember frontend always specifies a sort field. The query run by search when sorting by recent downloads currently accounts for nearly 40% of our total DB load. It takes ~60ms on average, which means it's extremely high load. Our logs for traffic to this endpoint with an explicit `sort=recent-downloads` don't match up with this. We can fix this by making the query faster, but this has drawbacks that were determined not to be worth it (see rust-lang#1755). So if we can't decrease execution time, the only other option is to decrease execution count. A significant number of crawlers hit us without specifying any sorting, which should mean they don't care about ordering. I'm sure there's someone out there who is relying on this behavior, but most crawlers I've seen don't specify ordering, and none of them care about the order they get the results in since they're crawling the whole registry anyway. What's important is that this lowers the execution time from ~60ms to ~12ms. And sorting on recent_downloads will grow linearly with the table size, while this can be done on an index alone (which should be log(log(n)) IIRC). >90% of the query is spent on getting the count, which will be going away soon as we change pagination strategies, bringing us to sub-millisecond execution. I do feel strongly that we either need to make this change, or accept the drawbacks in rust-lang#1755, as this will soon become a scaling issue for us.
…ibel Change the default sorting on search to something less expensive This changes the behavior of `/api/v1/crates` if no sort option is provided. This does not change the behavior of the website when accessed through a browser, as the Ember frontend always specifies a sort field. The query run by search when sorting by recent downloads currently accounts for nearly 40% of our total DB load. It takes ~60ms on average, which means it's extremely high load. Our logs for traffic to this endpoint with an explicit `sort=recent-downloads` don't match up with this. We can fix this by making the query faster, but this has drawbacks that were determined not to be worth it (see #1755). So if we can't decrease execution time, the only other option is to decrease execution count. A significant number of crawlers hit us without specifying any sorting, which should mean they don't care about ordering. I'm sure there's someone out there who is relying on this behavior, but most crawlers I've seen don't specify ordering, and none of them care about the order they get the results in since they're crawling the whole registry anyway. What's important is that this lowers the execution time from ~60ms to ~12ms. And sorting on recent_downloads will grow linearly with the table size, while this can be done on an index alone (which should be log(log(n)) IIRC). >90% of the query is spent on getting the count, which will be going away soon as we change pagination strategies, bringing us to sub-millisecond execution. I do feel strongly that we either need to make this change, or accept the drawbacks in #1755, as this will soon become a scaling issue for us. r? @jtgeibel
I'm seeing two queries that are beginning to slow down beyond acceptable
levels, both sort by
recent_crate_downloads.downloads
. The first oneis the "Most Recent Downloads" section on the summary page, which is
just solved by adding an index on the downloads column.
The second case is the "recent downloads" sort on search, which is a bit
trickier. When the column is being sorted from the right side of a left
join, the index can no longer be used (since there might not be a row
there at all). The only way to fix this is to switch to using an inner
join, which would mean that crates that haven't been downloaded in the
last 90 days won't show up in search. In practice this only occurs for
very recently uploaded crates, since crawlers like libs.rs end up
downloading every crate. However, it could still lead to confusion, so
I've redefined the view to make sure there's a row present for every
crate. The crate still won't show up in search until the view gets
updated though, which occurs every 10 minutes.
This did cause some problems in our tests though, since a bunch of tests
don't have any recent downloads, but assume that the crate is present in
search, etc. Initially I tried just always updating the view in
PublishBuilder
, but since refreshing the view grabs an exclusive lock,it meant our test suite was much less parallel, and it also exacerbated
tests that can deadlock (pretty much every test in teams.rs has a
deadlocking issue, but fixing it requires none of the tests using the
same team name which is non-trivial).
Ultimately I opted to only refresh the view on the endpoints that will
be querying it, which means the tests don't need to care about whether
they're depending on this or not, and tests which never hit those
endpoints won't block waiting for other tests that do. The
implementation was a tad tricky. We need to ignore errors that occur
refreshing it since we might be in read only mode. I also had to put the
generic constraints for
C
on the struct itself instead of just theHandler
impl, otherwise the closures passed to it in the unit testsfor the router didn't seem to infer the correct closure type.
Two other options would be to add the index without the join change/view
definition change, which would mean that the summary page gets fixed,
but search is still slow. Summary is much more frequently hit than
search ordered by recent downloads, so this wouldn't be the end of the
world.
The other option would be to run our test suite with
--test-threads=2
(interestingly, 2 test threads does not run tests in parallel, and our
suite blows up with
--test-threads=1
). If we did this, we couldrefresh the view from
CrateBuilder
(and the handful of tests whichactually hit the publish endpoint), so it wouldn't really remove any
complexity, it'd just move it to the tests directory.
Before:
After: