Skip to content

loosen search #1560

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Apr 24, 2019
Merged

loosen search #1560

merged 11 commits into from
Apr 24, 2019

Conversation

FreeMasen
Copy link

To address #159,

It might be worth it to add a new flag for searching with the original behavior.

I have played around with adding a flag to the query params, strict to allow the original behavior. It is very much a viable option, I didn't want to make any UI changes w/o first checking in about what it should look like.

@sgrif
Copy link
Contributor

sgrif commented Jan 24, 2019

Hi @FreeMasen,

Thanks for the PR. This needs some more work before it can be merged. To start, you've got a very large loose_search function which is never used, and should be removed. The comparison should also be case insensitive here, so we should use ilike instead of like. This also doesn't have any tests, which will be needed (in particular we'd like to be sure these are at the bottom of the list when sorted by "relevance" -- we're discussing larger changes to search in general, but in the mean time we don't want results that only came in from like on the title to be higher than those that are ranked highly in the existing search).

Let me know if you're still interested in working on this, or if you need additional help. I'll do a more in depth review once these issues are addressed.

@FreeMasen
Copy link
Author

I'll get started cleaning things up now.

@FreeMasen
Copy link
Author

I added a test with one of the expectations from the original issue (#159). I am not sure what the expectation should be so let me know if you are looking for something specific.

@FreeMasen
Copy link
Author

It appears that I accidentally close this PR, reopening

@FreeMasen FreeMasen reopened this Jan 25, 2019
/// WHERE name like $1
/// ```
pub fn like_name(name: &str) -> LikeName<'_> {
canon_crate_name(crates::name).ilike(canon_crate_name(name))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just like is enough here, canon_crate_name already normalizes case

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can change this back, the only reason I used ilike was from your previous comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about that :(

/// ```sql
/// SELECT *
/// FROM crates
/// WHERE name like $1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the SQL examples here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They can be removed.

.description("description")
.keyword("kw1")
.expect_build(conn);
// mkstemp should appear 5th
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put a bit more context in here? Why should this be 5th instead of tempfile?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason that is the case is because it was inserted later. I can remove that one.

@@ -414,6 +414,69 @@ fn exact_match_on_queries_with_sort() {
assert_eq!(json.crates[3].name, "other_sort");
}

#[test]
fn loose_search_order() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is extremely long given how little code is being changed. Do you think this might be retesting things that are already tested elsewhere? Would it be worth trying to narrow the scope a bit to make it more clear what's being tested?

Copy link
Author

@FreeMasen FreeMasen Feb 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General sort order is being addressed in other tests. I could reduce the number of creates to simplify the test. The basic idea is that we are now including crates that are partial matches. One of the previous comments about this PR was that the change should't impact current ordering so I tried to capture a few pieces of information that go into the current ordering like _ working like white space or description matching superseding this new search. I can remove the non-matching crate and the extra partial match to shorten things up?

@carols10cents
Copy link
Member

Hi @FreeMasen! I'm sorry about how long this has taken to review; thank you for your patience!

This looks great to me, aside from one little thing-- because the query with %s added is only used in the like_name function, and is sort of an implementation detail of the substring matching anyway, I moved the format! into like_name. I went ahead and took care of that because of how long you've been waiting, I didn't want to have to make you wait any more to see this merged!

I forget if I have to wait for travis or if I can tell bors to merge now, let's find out and I'll come back later if this isn't valid:

@bors r+

@bors
Copy link
Contributor

bors commented Apr 24, 2019

📌 Commit 346b573 has been approved by carols10cents

@bors
Copy link
Contributor

bors commented Apr 24, 2019

⌛ Testing commit 346b573 with merge 4bc88e9...

bors added a commit that referenced this pull request Apr 24, 2019
loosen search

To address #159,

It might be worth it to add a new flag for searching with the original behavior.

I have played around with adding a flag to the query params, `strict` to allow the original behavior. It is very much a viable option, I didn't want to make any UI changes w/o first checking in about what it should look like.
@bors
Copy link
Contributor

bors commented Apr 24, 2019

☀️ Test successful - checks-travis
Approved by: carols10cents
Pushing 4bc88e9 to master...

@bors bors merged commit 346b573 into rust-lang:master Apr 24, 2019
@FreeMasen
Copy link
Author

@carols10cents Thank you so much for doing all that!

sgrif added a commit to sgrif/crates.io that referenced this pull request May 14, 2019
A major performance regression was introduced by rust-lang#1560, as it didn't
introduce the appropriate index and the query is now falling back to a
sequential scan. This has caused the query to increase from 12ms on
average to 756ms on average. This introduces the appropriate index,
bringing performance back to its previous levels.

Original Query
---

```
                                                                                            QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=180.26..180.26 rows=1 width=881) (actual time=0.509..0.509 rows=0 loops=1)
   ->  WindowAgg  (cost=180.26..180.26 rows=1 width=881) (actual time=0.508..0.508 rows=0 loops=1)
         ->  Sort  (cost=180.26..180.26 rows=1 width=873) (actual time=0.506..0.506 rows=0 loops=1)
               Sort Key: ((replace(lower((crates.name)::text), '-'::text, '_'::text) = 'crates_io_test'::text)) DESC, crates.name
               Sort Method: quicksort  Memory: 25kB
               ->  Nested Loop Left Join  (cost=174.19..180.25 rows=1 width=873) (actual time=0.502..0.502 rows=0 loops=1)
                     ->  Bitmap Heap Scan on crates  (cost=174.13..176.19 rows=1 width=864) (actual time=0.502..0.502 rows=0 loops=1)
                           Recheck Cond: ((plainto_tsquery('crates-io-test'::text) @@ textsearchable_index_col) OR (replace(lower((name)::text), '-'::text, '_'::text) = 'crates_io_test'::text))
                           ->  BitmapOr  (cost=174.13..174.13 rows=1 width=0) (actual time=0.500..0.500 rows=0 loops=1)
                                 ->  Bitmap Index Scan on index_crates_name_search  (cost=0.00..172.05 rows=1 width=0) (actual time=0.475..0.475 rows=0 loops=1)
                                       Index Cond: (plainto_tsquery('crates-io-test'::text) @@ textsearchable_index_col)
                                 ->  Bitmap Index Scan on index_crates_name  (cost=0.00..2.08 rows=1 width=0) (actual time=0.025..0.025 rows=0 loops=1)
                                       Index Cond: (replace(lower((name)::text), '-'::text, '_'::text) = 'crates_io_test'::text)
                     ->  Index Scan using recent_crate_downloads_crate_id on recent_crate_downloads  (cost=0.06..4.06 rows=1 width=12) (never executed)
                           Index Cond: (crate_id = crates.id)
 Planning Time: 0.658 ms
 Execution Time: 0.565 ms
```

New Query w/o this commit
---

```
                                                                                         QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=4614.84..4614.86 rows=3 width=881) (actual time=225.720..225.720 rows=0 loops=1)
   ->  WindowAgg  (cost=4614.84..4614.86 rows=3 width=881) (actual time=225.719..225.719 rows=0 loops=1)
         ->  Sort  (cost=4614.84..4614.84 rows=3 width=873) (actual time=225.717..225.717 rows=0 loops=1)
               Sort Key: ((replace(lower((crates.name)::text), '-'::text, '_'::text) = 'crates_io_test'::text)) DESC, crates.name
               Sort Method: quicksort  Memory: 25kB
               ->  Nested Loop Left Join  (cost=0.06..4614.84 rows=3 width=873) (actual time=225.710..225.711 rows=0 loops=1)
                     ->  Seq Scan on crates  (cost=0.00..4602.65 rows=3 width=864) (actual time=225.709..225.709 rows=0 loops=1)
                           Filter: ((plainto_tsquery('crates-io-test'::text) @@ textsearchable_index_col) OR (replace(lower((name)::text), '-'::text, '_'::text) ~~ 'crates_io_test'::text))
                           Rows Removed by Filter: 25739
                     ->  Index Scan using recent_crate_downloads_crate_id on recent_crate_downloads  (cost=0.06..4.06 rows=1 width=12) (never executed)
                           Index Cond: (crate_id = crates.id)
 Planning Time: 0.536 ms
 Execution Time: 225.774 ms
```

New Query after this commit

```
                                                                                            QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=80.32..80.34 rows=3 width=881) (actual time=0.466..0.466 rows=0 loops=1)
   ->  WindowAgg  (cost=80.32..80.34 rows=3 width=881) (actual time=0.465..0.465 rows=0 loops=1)
         ->  Sort  (cost=80.32..80.32 rows=3 width=873) (actual time=0.464..0.464 rows=0 loops=1)
               Sort Key: ((replace(lower((crates.name)::text), '-'::text, '_'::text) = 'crates_io_test'::text)) DESC, crates.name
               Sort Method: quicksort  Memory: 25kB
               ->  Nested Loop Left Join  (cost=62.11..80.32 rows=3 width=873) (actual time=0.459..0.459 rows=0 loops=1)
                     ->  Bitmap Heap Scan on crates  (cost=62.05..68.13 rows=3 width=864) (actual time=0.458..0.458 rows=0 loops=1)
                           Recheck Cond: ((plainto_tsquery('crates-io-test'::text) @@ textsearchable_index_col) OR (replace(lower((name)::text), '-'::text, '_'::text) ~~ 'crates_io_test'::text))
                           Rows Removed by Index Recheck: 22
                           Heap Blocks: exact=21
                           ->  BitmapOr  (cost=62.05..62.05 rows=3 width=0) (actual time=0.255..0.255 rows=0 loops=1)
                                 ->  Bitmap Index Scan on index_crates_name_search  (cost=0.00..28.05 rows=1 width=0) (actual time=0.103..0.103 rows=0 loops=1)
                                       Index Cond: (plainto_tsquery('crates-io-test'::text) @@ textsearchable_index_col)
                                 ->  Bitmap Index Scan on sgrif_testing  (cost=0.00..34.00 rows=3 width=0) (actual time=0.151..0.151 rows=22 loops=1)
                                       Index Cond: (replace(lower((name)::text), '-'::text, '_'::text) ~~ 'crates_io_test'::text)
                     ->  Index Scan using recent_crate_downloads_crate_id on recent_crate_downloads  (cost=0.06..4.06 rows=1 width=12) (never executed)
                           Index Cond: (crate_id = crates.id)
 Planning Time: 0.476 ms
 Execution Time: 0.528 ms
```
bors added a commit that referenced this pull request May 19, 2019
Fix performance regression on crate search

A major performance regression was introduced by #1560, as it didn't
introduce the appropriate index and the query is now falling back to a
sequential scan. This has caused the query to increase from 12ms on
average to 756ms on average. This introduces the appropriate index,
bringing performance back to its previous levels.

Original Query
---

```
                                                                                            QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=180.26..180.26 rows=1 width=881) (actual time=0.509..0.509 rows=0 loops=1)
   ->  WindowAgg  (cost=180.26..180.26 rows=1 width=881) (actual time=0.508..0.508 rows=0 loops=1)
         ->  Sort  (cost=180.26..180.26 rows=1 width=873) (actual time=0.506..0.506 rows=0 loops=1)
               Sort Key: ((replace(lower((crates.name)::text), '-'::text, '_'::text) = 'crates_io_test'::text)) DESC, crates.name
               Sort Method: quicksort  Memory: 25kB
               ->  Nested Loop Left Join  (cost=174.19..180.25 rows=1 width=873) (actual time=0.502..0.502 rows=0 loops=1)
                     ->  Bitmap Heap Scan on crates  (cost=174.13..176.19 rows=1 width=864) (actual time=0.502..0.502 rows=0 loops=1)
                           Recheck Cond: ((plainto_tsquery('crates-io-test'::text) @@ textsearchable_index_col) OR (replace(lower((name)::text), '-'::text, '_'::text) = 'crates_io_test'::text))
                           ->  BitmapOr  (cost=174.13..174.13 rows=1 width=0) (actual time=0.500..0.500 rows=0 loops=1)
                                 ->  Bitmap Index Scan on index_crates_name_search  (cost=0.00..172.05 rows=1 width=0) (actual time=0.475..0.475 rows=0 loops=1)
                                       Index Cond: (plainto_tsquery('crates-io-test'::text) @@ textsearchable_index_col)
                                 ->  Bitmap Index Scan on index_crates_name  (cost=0.00..2.08 rows=1 width=0) (actual time=0.025..0.025 rows=0 loops=1)
                                       Index Cond: (replace(lower((name)::text), '-'::text, '_'::text) = 'crates_io_test'::text)
                     ->  Index Scan using recent_crate_downloads_crate_id on recent_crate_downloads  (cost=0.06..4.06 rows=1 width=12) (never executed)
                           Index Cond: (crate_id = crates.id)
 Planning Time: 0.658 ms
 Execution Time: 0.565 ms
```

New Query w/o this commit
---

```
                                                                                         QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=4614.84..4614.86 rows=3 width=881) (actual time=225.720..225.720 rows=0 loops=1)
   ->  WindowAgg  (cost=4614.84..4614.86 rows=3 width=881) (actual time=225.719..225.719 rows=0 loops=1)
         ->  Sort  (cost=4614.84..4614.84 rows=3 width=873) (actual time=225.717..225.717 rows=0 loops=1)
               Sort Key: ((replace(lower((crates.name)::text), '-'::text, '_'::text) = 'crates_io_test'::text)) DESC, crates.name
               Sort Method: quicksort  Memory: 25kB
               ->  Nested Loop Left Join  (cost=0.06..4614.84 rows=3 width=873) (actual time=225.710..225.711 rows=0 loops=1)
                     ->  Seq Scan on crates  (cost=0.00..4602.65 rows=3 width=864) (actual time=225.709..225.709 rows=0 loops=1)
                           Filter: ((plainto_tsquery('crates-io-test'::text) @@ textsearchable_index_col) OR (replace(lower((name)::text), '-'::text, '_'::text) ~~ 'crates_io_test'::text))
                           Rows Removed by Filter: 25739
                     ->  Index Scan using recent_crate_downloads_crate_id on recent_crate_downloads  (cost=0.06..4.06 rows=1 width=12) (never executed)
                           Index Cond: (crate_id = crates.id)
 Planning Time: 0.536 ms
 Execution Time: 225.774 ms
```

New Query after this commit

```
                                                                                            QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=80.32..80.34 rows=3 width=881) (actual time=0.466..0.466 rows=0 loops=1)
   ->  WindowAgg  (cost=80.32..80.34 rows=3 width=881) (actual time=0.465..0.465 rows=0 loops=1)
         ->  Sort  (cost=80.32..80.32 rows=3 width=873) (actual time=0.464..0.464 rows=0 loops=1)
               Sort Key: ((replace(lower((crates.name)::text), '-'::text, '_'::text) = 'crates_io_test'::text)) DESC, crates.name
               Sort Method: quicksort  Memory: 25kB
               ->  Nested Loop Left Join  (cost=62.11..80.32 rows=3 width=873) (actual time=0.459..0.459 rows=0 loops=1)
                     ->  Bitmap Heap Scan on crates  (cost=62.05..68.13 rows=3 width=864) (actual time=0.458..0.458 rows=0 loops=1)
                           Recheck Cond: ((plainto_tsquery('crates-io-test'::text) @@ textsearchable_index_col) OR (replace(lower((name)::text), '-'::text, '_'::text) ~~ 'crates_io_test'::text))
                           Rows Removed by Index Recheck: 22
                           Heap Blocks: exact=21
                           ->  BitmapOr  (cost=62.05..62.05 rows=3 width=0) (actual time=0.255..0.255 rows=0 loops=1)
                                 ->  Bitmap Index Scan on index_crates_name_search  (cost=0.00..28.05 rows=1 width=0) (actual time=0.103..0.103 rows=0 loops=1)
                                       Index Cond: (plainto_tsquery('crates-io-test'::text) @@ textsearchable_index_col)
                                 ->  Bitmap Index Scan on sgrif_testing  (cost=0.00..34.00 rows=3 width=0) (actual time=0.151..0.151 rows=22 loops=1)
                                       Index Cond: (replace(lower((name)::text), '-'::text, '_'::text) ~~ 'crates_io_test'::text)
                     ->  Index Scan using recent_crate_downloads_crate_id on recent_crate_downloads  (cost=0.06..4.06 rows=1 width=12) (never executed)
                           Index Cond: (crate_id = crates.id)
 Planning Time: 0.476 ms
 Execution Time: 0.528 ms
```
sgrif added a commit to sgrif/crates.io that referenced this pull request May 20, 2019
Even after rust-lang#1746, we're still seeing a performance issue with search in
production. Now it's limited to searches that are a single letter, or
short 2 letter words like 'do'. It's caused by any search that would
cause PG to warn that the query contains only stopwords. It appears the
code path taken when `plainto_tsquery` returns an empty query is
substantially slower than it would be otherwise, even if the query
contains stopwords.

The reason this has started causing problems now is that rust-lang#1560 caused
the query to change from performing a nested loop join to a hash join.
Due to what appears to be a bug in PG, `plainto_tsquery` is getting
called once per row when a hash join is performed. When the query is
passed as the only argument, the function is declared as `STABLE`,
meaning that within a single statement it will always return the same
result for the same arguments, so PG should only be calling it once (or
at least only a handful of times).

There's a second form available where you explicitly pass the language
as an argument. This form is marked as `IMMUTABLE`, so the query planner
will just replace the call to the function with its results.

Unfortunately, PG is picky about how we pass the language. It doesn't
consider a cast from `text` to `regconfig` to be `IMMUTABLE`, only
`STABLE` (which is valid, since it's based on a `pg_catalog` lookup --
The fact that it accepts a string literal as `IMMUTABLE` actually seems
wrong). The actual value is the OID of the row in `pg_ts_config`, which
is *not* stable. Since `regconfig('english'::text)` is not considered
`IMMUTABLE`, we just need to embed it as a string literal instead.
bors added a commit that referenced this pull request May 21, 2019
Further address performance regression in search

Even after #1746, we're still seeing a performance issue with search in
production. Now it's limited to searches that are a single letter, or
short 2 letter words like 'do'. It's caused by any search that would
cause PG to warn that the query contains only stopwords. It appears the
code path taken when `plainto_tsquery` returns an empty query is
substantially slower than it would be otherwise, even if the query
contains stopwords.

The reason this has started causing problems now is that #1560 caused
the query to change from performing a nested loop join to a hash join.
Due to what appears to be a bug in PG, `plainto_tsquery` is getting
called once per row when a hash join is performed. When the query is
passed as the only argument, the function is declared as `STABLE`,
meaning that within a single statement it will always return the same
result for the same arguments, so PG should only be calling it once (or
at least only a handful of times).

There's a second form available where you explicitly pass the language
as an argument. This form is marked as `IMMUTABLE`, so the query planner
will just replace the call to the function with its results.

Unfortunately, PG is picky about how we pass the language. It doesn't
consider a cast from `text` to `regconfig` to be `IMMUTABLE`, only
`STABLE` (which is valid, since it's based on a `pg_catalog` lookup --
The fact that it accepts a string literal as `IMMUTABLE` actually seems
wrong). The actual value is the OID of the row in `pg_ts_config`, which
is *not* stable. Since `regconfig('english'::text)` is not considered
`IMMUTABLE`, we just need to embed it as a string literal instead.
sgrif added a commit to sgrif/crates.io that referenced this pull request May 22, 2019
Since rust-lang#1560 we've had various performance problems with search. While
we've addressed most of the biggest issues, we still have poor
performance when the search term is 2 characters or longer (between
100-200ms at the time of writing this commit, though there is a PR being
merged which can regress this another 30%).

This level of performance isn't the end of the world, but the majority
of our 1 or 2 character searches come from a handful of crawlers which
hit us very frequently, and I'd prefer not to spend this much time on
queries for crawlers. (There's quite a few of these, they all do things
like search for 's', then 'se', then 'ser', etc over and over again --
User Agent only identifies the HTTP lib they used, which varies).

The performance issues come from two places. The first problem is that
we can't use our indexes when the search term is 2 characters, due to
how trigrams work. This means that we fall back to doing a sequential
scan of the entire table, which will only get worse as time goes on. For
single letter searches, the second issue comes from the sheer number of
rows we get back, which have to go into an expensive hash join.

If you search for 'a', you get back 13k results. At the end of the day,
getting every crate with 'a' in its name is not useful, so I've tried to
go with a solution that both improves our performance and also return
more useful results. The operator I've used is meant to return whether
any words are sufficiently similar, but since our search term is shorter
than a trigram the behavior is a little different. For 2 letter
searches, it ends up being "any word begins with the term", and for 1
letter searches, it's "any word is equal to the term". Here are some
example results for "do" and "a"

```
 do
 afi_docf
 alice-download
 async_docker
 avocado_derive
 cargo-build-docker
 cargo_crates-io_docs-rs_test
 cargo_crates-io_docs-rs_test2
 cargo-do
 cargo-doc-coverage
 cargo-docgen
 cargo-dock
 cargo-docker
 cargo-docker-builder
 cargo-docserve
 cargo-docserver
 cargo-doctor
 cargo-download
 cargo-external-doc
 cargo-pack-docker
 cargo-serve-doc
 devboat-docker
 doapi
 do-async
 doc
 doc_9_testing_12345
 docbase_io
 doc-cfg
 doc-comment
 doccy
 doc_file
 docker
 docker4rs
```

```
 a
 a-range
 cortex-a
 jacques_a_dit
 magic-number-a
 manish_this_is_a_test
 poke-a-mango
 vmx-just-a-test-001-maincrate
 wasm-bindgen-test-crate-a
```

Drawbacks
---

The original motivation for switching to `LIKE` in search was to make
sure libssh shows up when searching for ssh. This will regress that for
any `lib*` crates with less than 2 letter names. There aren't very many
of these:

- lib
- libc
- libcw
- libdw
- libgo
- libjp
- libm
- libnv
- libpm
- libr
- libs
- libsm
- libxm

I'm less concerned about the single letter cases, as those are already
going to be buried on page 87, but a few of the 2 letter cases you might
legitimately search for. None of these crates have high traffic, and
fixing this generally isn't really possible without introducing some
special case indexes *only* for this case. We could also work around
this by always searching for "lib*" in addition to whatever you searched
for.

This also means that searching for `a` will no longer include the crate
`a1`. I'm not as concerned about this, if you want all crates starting
with the letter a, we already have `/crates?letter=a` for that.

With this change, our performance should be back to reasonable levels
for all search terms.

Before
--

```
                                                                         QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=3895.90..3896.04 rows=20 width=882) (actual time=164.823..164.838 rows=20 loops=1)
   ->  WindowAgg  (cost=3895.90..3985.84 rows=12848 width=882) (actual time=164.821..164.832 rows=20 loops=1)
         ->  Sort  (cost=3895.90..3902.32 rows=12848 width=874) (actual time=155.012..156.425 rows=12599 loops=1)
               Sort Key: ((replace(lower((crates.name)::text), '-'::text, '_'::text) = 'a'::text)) DESC, crates.name
               Sort Method: quicksort  Memory: 9996kB
               ->  Hash Right Join  (cost=3410.76..3720.54 rows=12848 width=874) (actual time=95.457..116.592 rows=12599 loops=1)
                     Hash Cond: (recent_crate_downloads.crate_id = crates.id)
                     ->  Seq Scan on recent_crate_downloads  (cost=0.00..276.87 rows=25958 width=12) (actual time=0.012..2.753 rows=25958 loops=1)
                     ->  Hash  (cost=3365.79..3365.79 rows=12848 width=865) (actual time=95.417..95.417 rows=12599 loops=1)
                           Buckets: 16384  Batches: 1  Memory Usage: 6985kB
                           ->  Seq Scan on crates  (cost=0.00..3365.79 rows=12848 width=865) (actual time=0.015..85.416 rows=12599 loops=1)
                                 Filter: ((''::tsquery @@ textsearchable_index_col) OR (replace(lower((name)::text), '-'::text, '_'::text) ~~ '%a%'::text))
                                 Rows Removed by Filter: 13359
 Planning Time: 0.555 ms
 Execution Time: 165.998 ms
```

After
--

```
                                                                                   QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=159.61..159.75 rows=20 width=882) (actual time=0.321..0.325 rows=9 loops=1)
   ->  WindowAgg  (cost=159.61..159.80 rows=26 width=882) (actual time=0.320..0.324 rows=9 loops=1)
         ->  Sort  (cost=159.61..159.63 rows=26 width=874) (actual time=0.310..0.311 rows=9 loops=1)
               Sort Key: ((replace(lower((crates.name)::text), '-'::text, '_'::text) = 'a'::text)) DESC, crates.name
               Sort Method: quicksort  Memory: 30kB
               ->  Nested Loop Left Join  (cost=10.10..159.49 rows=26 width=874) (actual time=0.076..0.288 rows=9 loops=1)
                     ->  Bitmap Heap Scan on crates  (cost=10.04..59.84 rows=26 width=865) (actual time=0.057..0.196 rows=9 loops=1)
                           Recheck Cond: ((''::tsquery @@ textsearchable_index_col) OR (replace(lower((name)::text), '-'::text, '_'::text) %> 'a'::text))
                           Heap Blocks: exact=9
                           ->  BitmapOr  (cost=10.04..10.04 rows=26 width=0) (actual time=0.046..0.046 rows=0 loops=1)
                                 ->  Bitmap Index Scan on index_crates_name_search  (cost=0.00..0.00 rows=1 width=0) (actual time=0.001..0.001 rows=0 loops=1)
                                       Index Cond: (''::tsquery @@ textsearchable_index_col)
                                 ->  Bitmap Index Scan on index_crates_name_tgrm  (cost=0.00..10.04 rows=26 width=0) (actual time=0.044..0.044 rows=9 loops=1)
                                       Index Cond: (replace(lower((name)::text), '-'::text, '_'::text) %> 'a'::text)
                     ->  Index Scan using recent_crate_downloads_crate_id on recent_crate_downloads  (cost=0.06..3.83 rows=1 width=12) (actual time=0.008..0.008 rows=1 loops=9)
                           Index Cond: (crate_id = crates.id)
```
bors added a commit that referenced this pull request May 22, 2019
…tgeibel

Use a stricter form of search for extremely short queries

Since #1560 we've had various performance problems with search. While
we've addressed most of the biggest issues, we still have poor
performance when the search term is 2 characters or longer (between
100-200ms at the time of writing this commit, though there is a PR being
merged which can regress this another 30%).

This level of performance isn't the end of the world, but the majority
of our 1 or 2 character searches come from a handful of crawlers which
hit us very frequently, and I'd prefer not to spend this much time on
queries for crawlers. (There's quite a few of these, they all do things
like search for 's', then 'se', then 'ser', etc over and over again --
User Agent only identifies the HTTP lib they used, which varies).

The performance issues come from two places. The first problem is that
we can't use our indexes when the search term is 2 characters, due to
how trigrams work. This means that we fall back to doing a sequential
scan of the entire table, which will only get worse as time goes on. For
single letter searches, the second issue comes from the sheer number of
rows we get back, which have to go into an expensive hash join.

If you search for 'a', you get back 13k results. At the end of the day,
getting every crate with 'a' in its name is not useful, so I've tried to
go with a solution that both improves our performance and also return
more useful results. The operator I've used is meant to return whether
any words are sufficiently similar, but since our search term is shorter
than a trigram the behavior is a little different. For 2 letter
searches, it ends up being "any word begins with the term", and for 1
letter searches, it's "any word is equal to the term". Here are some
example results for "do" and "a"

```
 do
 afi_docf
 alice-download
 async_docker
 avocado_derive
 cargo-build-docker
 cargo_crates-io_docs-rs_test
 cargo_crates-io_docs-rs_test2
 cargo-do
 cargo-doc-coverage
 cargo-docgen
 cargo-dock
 cargo-docker
 cargo-docker-builder
 cargo-docserve
 cargo-docserver
 cargo-doctor
 cargo-download
 cargo-external-doc
 cargo-pack-docker
 cargo-serve-doc
 devboat-docker
 doapi
 do-async
 doc
 doc_9_testing_12345
 docbase_io
 doc-cfg
 doc-comment
 doccy
 doc_file
 docker
 docker4rs
```

```
 a
 a-range
 cortex-a
 jacques_a_dit
 magic-number-a
 manish_this_is_a_test
 poke-a-mango
 vmx-just-a-test-001-maincrate
 wasm-bindgen-test-crate-a
```

Drawbacks
---

The original motivation for switching to `LIKE` in search was to make
sure libssh shows up when searching for ssh. This will regress that for
any `lib*` crates with less than 2 letter names. There aren't very many
of these:

- lib
- libc
- libcw
- libdw
- libgo
- libjp
- libm
- libnv
- libpm
- libr
- libs
- libsm
- libxm

I'm less concerned about the single letter cases, as those are already
going to be buried on page 87, but a few of the 2 letter cases you might
legitimately search for. None of these crates have high traffic, and
fixing this generally isn't really possible without introducing some
special case indexes *only* for this case. We could also work around
this by always searching for "lib*" in addition to whatever you searched
for.

This also means that searching for `a` will no longer include the crate
`a1`. I'm not as concerned about this, if you want all crates starting
with the letter a, we already have `/crates?letter=a` for that.

With this change, our performance should be back to reasonable levels
for all search terms.

Before
--

```
                                                                         QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=3895.90..3896.04 rows=20 width=882) (actual time=164.823..164.838 rows=20 loops=1)
   ->  WindowAgg  (cost=3895.90..3985.84 rows=12848 width=882) (actual time=164.821..164.832 rows=20 loops=1)
         ->  Sort  (cost=3895.90..3902.32 rows=12848 width=874) (actual time=155.012..156.425 rows=12599 loops=1)
               Sort Key: ((replace(lower((crates.name)::text), '-'::text, '_'::text) = 'a'::text)) DESC, crates.name
               Sort Method: quicksort  Memory: 9996kB
               ->  Hash Right Join  (cost=3410.76..3720.54 rows=12848 width=874) (actual time=95.457..116.592 rows=12599 loops=1)
                     Hash Cond: (recent_crate_downloads.crate_id = crates.id)
                     ->  Seq Scan on recent_crate_downloads  (cost=0.00..276.87 rows=25958 width=12) (actual time=0.012..2.753 rows=25958 loops=1)
                     ->  Hash  (cost=3365.79..3365.79 rows=12848 width=865) (actual time=95.417..95.417 rows=12599 loops=1)
                           Buckets: 16384  Batches: 1  Memory Usage: 6985kB
                           ->  Seq Scan on crates  (cost=0.00..3365.79 rows=12848 width=865) (actual time=0.015..85.416 rows=12599 loops=1)
                                 Filter: ((''::tsquery @@ textsearchable_index_col) OR (replace(lower((name)::text), '-'::text, '_'::text) ~~ '%a%'::text))
                                 Rows Removed by Filter: 13359
 Planning Time: 0.555 ms
 Execution Time: 165.998 ms
```

After
--

```
                                                                                   QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=159.61..159.75 rows=20 width=882) (actual time=0.321..0.325 rows=9 loops=1)
   ->  WindowAgg  (cost=159.61..159.80 rows=26 width=882) (actual time=0.320..0.324 rows=9 loops=1)
         ->  Sort  (cost=159.61..159.63 rows=26 width=874) (actual time=0.310..0.311 rows=9 loops=1)
               Sort Key: ((replace(lower((crates.name)::text), '-'::text, '_'::text) = 'a'::text)) DESC, crates.name
               Sort Method: quicksort  Memory: 30kB
               ->  Nested Loop Left Join  (cost=10.10..159.49 rows=26 width=874) (actual time=0.076..0.288 rows=9 loops=1)
                     ->  Bitmap Heap Scan on crates  (cost=10.04..59.84 rows=26 width=865) (actual time=0.057..0.196 rows=9 loops=1)
                           Recheck Cond: ((''::tsquery @@ textsearchable_index_col) OR (replace(lower((name)::text), '-'::text, '_'::text) %> 'a'::text))
                           Heap Blocks: exact=9
                           ->  BitmapOr  (cost=10.04..10.04 rows=26 width=0) (actual time=0.046..0.046 rows=0 loops=1)
                                 ->  Bitmap Index Scan on index_crates_name_search  (cost=0.00..0.00 rows=1 width=0) (actual time=0.001..0.001 rows=0 loops=1)
                                       Index Cond: (''::tsquery @@ textsearchable_index_col)
                                 ->  Bitmap Index Scan on index_crates_name_tgrm  (cost=0.00..10.04 rows=26 width=0) (actual time=0.044..0.044 rows=9 loops=1)
                                       Index Cond: (replace(lower((name)::text), '-'::text, '_'::text) %> 'a'::text)
                     ->  Index Scan using recent_crate_downloads_crate_id on recent_crate_downloads  (cost=0.06..3.83 rows=1 width=12) (actual time=0.008..0.008 rows=1 loops=9)
                           Index Cond: (crate_id = crates.id)
 Planning Time: 0.553 ms
 Execution Time: 0.386 ms
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants