Skip to content

Use a stricter form of search for extremely short queries #1752

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 23, 2019

Conversation

sgrif
Copy link
Contributor

@sgrif sgrif commented May 22, 2019

Since #1560 we've had various performance problems with search. While
we've addressed most of the biggest issues, we still have poor
performance when the search term is 2 characters or longer (between
100-200ms at the time of writing this commit, though there is a PR being
merged which can regress this another 30%).

This level of performance isn't the end of the world, but the majority
of our 1 or 2 character searches come from a handful of crawlers which
hit us very frequently, and I'd prefer not to spend this much time on
queries for crawlers. (There's quite a few of these, they all do things
like search for 's', then 'se', then 'ser', etc over and over again --
User Agent only identifies the HTTP lib they used, which varies).

The performance issues come from two places. The first problem is that
we can't use our indexes when the search term is 2 characters, due to
how trigrams work. This means that we fall back to doing a sequential
scan of the entire table, which will only get worse as time goes on. For
single letter searches, the second issue comes from the sheer number of
rows we get back, which have to go into an expensive hash join.

If you search for 'a', you get back 13k results. At the end of the day,
getting every crate with 'a' in its name is not useful, so I've tried to
go with a solution that both improves our performance and also return
more useful results. The operator I've used is meant to return whether
any words are sufficiently similar, but since our search term is shorter
than a trigram the behavior is a little different. For 2 letter
searches, it ends up being "any word begins with the term", and for 1
letter searches, it's "any word is equal to the term". Here are some
example results for "do" and "a"

 do
 afi_docf
 alice-download
 async_docker
 avocado_derive
 cargo-build-docker
 cargo_crates-io_docs-rs_test
 cargo_crates-io_docs-rs_test2
 cargo-do
 cargo-doc-coverage
 cargo-docgen
 cargo-dock
 cargo-docker
 cargo-docker-builder
 cargo-docserve
 cargo-docserver
 cargo-doctor
 cargo-download
 cargo-external-doc
 cargo-pack-docker
 cargo-serve-doc
 devboat-docker
 doapi
 do-async
 doc
 doc_9_testing_12345
 docbase_io
 doc-cfg
 doc-comment
 doccy
 doc_file
 docker
 docker4rs
 a
 a-range
 cortex-a
 jacques_a_dit
 magic-number-a
 manish_this_is_a_test
 poke-a-mango
 vmx-just-a-test-001-maincrate
 wasm-bindgen-test-crate-a

Drawbacks

The original motivation for switching to LIKE in search was to make
sure libssh shows up when searching for ssh. This will regress that for
any lib* crates with less than 2 letter names. There aren't very many
of these:

  • lib
  • libc
  • libcw
  • libdw
  • libgo
  • libjp
  • libm
  • libnv
  • libpm
  • libr
  • libs
  • libsm
  • libxm

I'm less concerned about the single letter cases, as those are already
going to be buried on page 87, but a few of the 2 letter cases you might
legitimately search for. None of these crates have high traffic, and
fixing this generally isn't really possible without introducing some
special case indexes only for this case. We could also work around
this by always searching for "lib*" in addition to whatever you searched
for.

This also means that searching for a will no longer include the crate
a1. I'm not as concerned about this, if you want all crates starting
with the letter a, we already have /crates?letter=a for that.

With this change, our performance should be back to reasonable levels
for all search terms.

Before

                                                                         QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=3895.90..3896.04 rows=20 width=882) (actual time=164.823..164.838 rows=20 loops=1)
   ->  WindowAgg  (cost=3895.90..3985.84 rows=12848 width=882) (actual time=164.821..164.832 rows=20 loops=1)
         ->  Sort  (cost=3895.90..3902.32 rows=12848 width=874) (actual time=155.012..156.425 rows=12599 loops=1)
               Sort Key: ((replace(lower((crates.name)::text), '-'::text, '_'::text) = 'a'::text)) DESC, crates.name
               Sort Method: quicksort  Memory: 9996kB
               ->  Hash Right Join  (cost=3410.76..3720.54 rows=12848 width=874) (actual time=95.457..116.592 rows=12599 loops=1)
                     Hash Cond: (recent_crate_downloads.crate_id = crates.id)
                     ->  Seq Scan on recent_crate_downloads  (cost=0.00..276.87 rows=25958 width=12) (actual time=0.012..2.753 rows=25958 loops=1)
                     ->  Hash  (cost=3365.79..3365.79 rows=12848 width=865) (actual time=95.417..95.417 rows=12599 loops=1)
                           Buckets: 16384  Batches: 1  Memory Usage: 6985kB
                           ->  Seq Scan on crates  (cost=0.00..3365.79 rows=12848 width=865) (actual time=0.015..85.416 rows=12599 loops=1)
                                 Filter: ((''::tsquery @@ textsearchable_index_col) OR (replace(lower((name)::text), '-'::text, '_'::text) ~~ '%a%'::text))
                                 Rows Removed by Filter: 13359
 Planning Time: 0.555 ms
 Execution Time: 165.998 ms

After

                                                                                   QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=159.61..159.75 rows=20 width=882) (actual time=0.321..0.325 rows=9 loops=1)
   ->  WindowAgg  (cost=159.61..159.80 rows=26 width=882) (actual time=0.320..0.324 rows=9 loops=1)
         ->  Sort  (cost=159.61..159.63 rows=26 width=874) (actual time=0.310..0.311 rows=9 loops=1)
               Sort Key: ((replace(lower((crates.name)::text), '-'::text, '_'::text) = 'a'::text)) DESC, crates.name
               Sort Method: quicksort  Memory: 30kB
               ->  Nested Loop Left Join  (cost=10.10..159.49 rows=26 width=874) (actual time=0.076..0.288 rows=9 loops=1)
                     ->  Bitmap Heap Scan on crates  (cost=10.04..59.84 rows=26 width=865) (actual time=0.057..0.196 rows=9 loops=1)
                           Recheck Cond: ((''::tsquery @@ textsearchable_index_col) OR (replace(lower((name)::text), '-'::text, '_'::text) %> 'a'::text))
                           Heap Blocks: exact=9
                           ->  BitmapOr  (cost=10.04..10.04 rows=26 width=0) (actual time=0.046..0.046 rows=0 loops=1)
                                 ->  Bitmap Index Scan on index_crates_name_search  (cost=0.00..0.00 rows=1 width=0) (actual time=0.001..0.001 rows=0 loops=1)
                                       Index Cond: (''::tsquery @@ textsearchable_index_col)
                                 ->  Bitmap Index Scan on index_crates_name_tgrm  (cost=0.00..10.04 rows=26 width=0) (actual time=0.044..0.044 rows=9 loops=1)
                                       Index Cond: (replace(lower((name)::text), '-'::text, '_'::text) %> 'a'::text)
                     ->  Index Scan using recent_crate_downloads_crate_id on recent_crate_downloads  (cost=0.06..3.83 rows=1 width=12) (actual time=0.008..0.008 rows=1 loops=9)
                           Index Cond: (crate_id = crates.id)
 Planning Time: 0.553 ms
 Execution Time: 0.386 ms

Since rust-lang#1560 we've had various performance problems with search. While
we've addressed most of the biggest issues, we still have poor
performance when the search term is 2 characters or longer (between
100-200ms at the time of writing this commit, though there is a PR being
merged which can regress this another 30%).

This level of performance isn't the end of the world, but the majority
of our 1 or 2 character searches come from a handful of crawlers which
hit us very frequently, and I'd prefer not to spend this much time on
queries for crawlers. (There's quite a few of these, they all do things
like search for 's', then 'se', then 'ser', etc over and over again --
User Agent only identifies the HTTP lib they used, which varies).

The performance issues come from two places. The first problem is that
we can't use our indexes when the search term is 2 characters, due to
how trigrams work. This means that we fall back to doing a sequential
scan of the entire table, which will only get worse as time goes on. For
single letter searches, the second issue comes from the sheer number of
rows we get back, which have to go into an expensive hash join.

If you search for 'a', you get back 13k results. At the end of the day,
getting every crate with 'a' in its name is not useful, so I've tried to
go with a solution that both improves our performance and also return
more useful results. The operator I've used is meant to return whether
any words are sufficiently similar, but since our search term is shorter
than a trigram the behavior is a little different. For 2 letter
searches, it ends up being "any word begins with the term", and for 1
letter searches, it's "any word is equal to the term". Here are some
example results for "do" and "a"

```
 do
 afi_docf
 alice-download
 async_docker
 avocado_derive
 cargo-build-docker
 cargo_crates-io_docs-rs_test
 cargo_crates-io_docs-rs_test2
 cargo-do
 cargo-doc-coverage
 cargo-docgen
 cargo-dock
 cargo-docker
 cargo-docker-builder
 cargo-docserve
 cargo-docserver
 cargo-doctor
 cargo-download
 cargo-external-doc
 cargo-pack-docker
 cargo-serve-doc
 devboat-docker
 doapi
 do-async
 doc
 doc_9_testing_12345
 docbase_io
 doc-cfg
 doc-comment
 doccy
 doc_file
 docker
 docker4rs
```

```
 a
 a-range
 cortex-a
 jacques_a_dit
 magic-number-a
 manish_this_is_a_test
 poke-a-mango
 vmx-just-a-test-001-maincrate
 wasm-bindgen-test-crate-a
```

Drawbacks
---

The original motivation for switching to `LIKE` in search was to make
sure libssh shows up when searching for ssh. This will regress that for
any `lib*` crates with less than 2 letter names. There aren't very many
of these:

- lib
- libc
- libcw
- libdw
- libgo
- libjp
- libm
- libnv
- libpm
- libr
- libs
- libsm
- libxm

I'm less concerned about the single letter cases, as those are already
going to be buried on page 87, but a few of the 2 letter cases you might
legitimately search for. None of these crates have high traffic, and
fixing this generally isn't really possible without introducing some
special case indexes *only* for this case. We could also work around
this by always searching for "lib*" in addition to whatever you searched
for.

This also means that searching for `a` will no longer include the crate
`a1`. I'm not as concerned about this, if you want all crates starting
with the letter a, we already have `/crates?letter=a` for that.

With this change, our performance should be back to reasonable levels
for all search terms.

Before
--

```
                                                                         QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=3895.90..3896.04 rows=20 width=882) (actual time=164.823..164.838 rows=20 loops=1)
   ->  WindowAgg  (cost=3895.90..3985.84 rows=12848 width=882) (actual time=164.821..164.832 rows=20 loops=1)
         ->  Sort  (cost=3895.90..3902.32 rows=12848 width=874) (actual time=155.012..156.425 rows=12599 loops=1)
               Sort Key: ((replace(lower((crates.name)::text), '-'::text, '_'::text) = 'a'::text)) DESC, crates.name
               Sort Method: quicksort  Memory: 9996kB
               ->  Hash Right Join  (cost=3410.76..3720.54 rows=12848 width=874) (actual time=95.457..116.592 rows=12599 loops=1)
                     Hash Cond: (recent_crate_downloads.crate_id = crates.id)
                     ->  Seq Scan on recent_crate_downloads  (cost=0.00..276.87 rows=25958 width=12) (actual time=0.012..2.753 rows=25958 loops=1)
                     ->  Hash  (cost=3365.79..3365.79 rows=12848 width=865) (actual time=95.417..95.417 rows=12599 loops=1)
                           Buckets: 16384  Batches: 1  Memory Usage: 6985kB
                           ->  Seq Scan on crates  (cost=0.00..3365.79 rows=12848 width=865) (actual time=0.015..85.416 rows=12599 loops=1)
                                 Filter: ((''::tsquery @@ textsearchable_index_col) OR (replace(lower((name)::text), '-'::text, '_'::text) ~~ '%a%'::text))
                                 Rows Removed by Filter: 13359
 Planning Time: 0.555 ms
 Execution Time: 165.998 ms
```

After
--

```
                                                                                   QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=159.61..159.75 rows=20 width=882) (actual time=0.321..0.325 rows=9 loops=1)
   ->  WindowAgg  (cost=159.61..159.80 rows=26 width=882) (actual time=0.320..0.324 rows=9 loops=1)
         ->  Sort  (cost=159.61..159.63 rows=26 width=874) (actual time=0.310..0.311 rows=9 loops=1)
               Sort Key: ((replace(lower((crates.name)::text), '-'::text, '_'::text) = 'a'::text)) DESC, crates.name
               Sort Method: quicksort  Memory: 30kB
               ->  Nested Loop Left Join  (cost=10.10..159.49 rows=26 width=874) (actual time=0.076..0.288 rows=9 loops=1)
                     ->  Bitmap Heap Scan on crates  (cost=10.04..59.84 rows=26 width=865) (actual time=0.057..0.196 rows=9 loops=1)
                           Recheck Cond: ((''::tsquery @@ textsearchable_index_col) OR (replace(lower((name)::text), '-'::text, '_'::text) %> 'a'::text))
                           Heap Blocks: exact=9
                           ->  BitmapOr  (cost=10.04..10.04 rows=26 width=0) (actual time=0.046..0.046 rows=0 loops=1)
                                 ->  Bitmap Index Scan on index_crates_name_search  (cost=0.00..0.00 rows=1 width=0) (actual time=0.001..0.001 rows=0 loops=1)
                                       Index Cond: (''::tsquery @@ textsearchable_index_col)
                                 ->  Bitmap Index Scan on index_crates_name_tgrm  (cost=0.00..10.04 rows=26 width=0) (actual time=0.044..0.044 rows=9 loops=1)
                                       Index Cond: (replace(lower((name)::text), '-'::text, '_'::text) %> 'a'::text)
                     ->  Index Scan using recent_crate_downloads_crate_id on recent_crate_downloads  (cost=0.06..3.83 rows=1 width=12) (actual time=0.008..0.008 rows=1 loops=9)
                           Index Cond: (crate_id = crates.id)
```
@sgrif
Copy link
Contributor Author

sgrif commented May 22, 2019

I've finally tracked down where the weird short search queries are coming from -- It's from IDE autocompletion. I think these results are fine for that, or we can switch to prefix matching which will still hit our index. We should probably look into adding a more optimized endpoint for this use case

@jtgeibel
Copy link
Member

It's from IDE autocompletion.

That's interesting. I hadn't noticed before that the RLS did autocompletion of crate names in Cargo.toml. I should be possible for the RLS to do this locally from the index, but these changes look good to me.

@bors r+

@bors
Copy link
Contributor

bors commented May 22, 2019

📌 Commit 36893f6 has been approved by jtgeibel

@bors
Copy link
Contributor

bors commented May 22, 2019

⌛ Testing commit 36893f6 with merge 727788d...

bors added a commit that referenced this pull request May 22, 2019
…tgeibel

Use a stricter form of search for extremely short queries

Since #1560 we've had various performance problems with search. While
we've addressed most of the biggest issues, we still have poor
performance when the search term is 2 characters or longer (between
100-200ms at the time of writing this commit, though there is a PR being
merged which can regress this another 30%).

This level of performance isn't the end of the world, but the majority
of our 1 or 2 character searches come from a handful of crawlers which
hit us very frequently, and I'd prefer not to spend this much time on
queries for crawlers. (There's quite a few of these, they all do things
like search for 's', then 'se', then 'ser', etc over and over again --
User Agent only identifies the HTTP lib they used, which varies).

The performance issues come from two places. The first problem is that
we can't use our indexes when the search term is 2 characters, due to
how trigrams work. This means that we fall back to doing a sequential
scan of the entire table, which will only get worse as time goes on. For
single letter searches, the second issue comes from the sheer number of
rows we get back, which have to go into an expensive hash join.

If you search for 'a', you get back 13k results. At the end of the day,
getting every crate with 'a' in its name is not useful, so I've tried to
go with a solution that both improves our performance and also return
more useful results. The operator I've used is meant to return whether
any words are sufficiently similar, but since our search term is shorter
than a trigram the behavior is a little different. For 2 letter
searches, it ends up being "any word begins with the term", and for 1
letter searches, it's "any word is equal to the term". Here are some
example results for "do" and "a"

```
 do
 afi_docf
 alice-download
 async_docker
 avocado_derive
 cargo-build-docker
 cargo_crates-io_docs-rs_test
 cargo_crates-io_docs-rs_test2
 cargo-do
 cargo-doc-coverage
 cargo-docgen
 cargo-dock
 cargo-docker
 cargo-docker-builder
 cargo-docserve
 cargo-docserver
 cargo-doctor
 cargo-download
 cargo-external-doc
 cargo-pack-docker
 cargo-serve-doc
 devboat-docker
 doapi
 do-async
 doc
 doc_9_testing_12345
 docbase_io
 doc-cfg
 doc-comment
 doccy
 doc_file
 docker
 docker4rs
```

```
 a
 a-range
 cortex-a
 jacques_a_dit
 magic-number-a
 manish_this_is_a_test
 poke-a-mango
 vmx-just-a-test-001-maincrate
 wasm-bindgen-test-crate-a
```

Drawbacks
---

The original motivation for switching to `LIKE` in search was to make
sure libssh shows up when searching for ssh. This will regress that for
any `lib*` crates with less than 2 letter names. There aren't very many
of these:

- lib
- libc
- libcw
- libdw
- libgo
- libjp
- libm
- libnv
- libpm
- libr
- libs
- libsm
- libxm

I'm less concerned about the single letter cases, as those are already
going to be buried on page 87, but a few of the 2 letter cases you might
legitimately search for. None of these crates have high traffic, and
fixing this generally isn't really possible without introducing some
special case indexes *only* for this case. We could also work around
this by always searching for "lib*" in addition to whatever you searched
for.

This also means that searching for `a` will no longer include the crate
`a1`. I'm not as concerned about this, if you want all crates starting
with the letter a, we already have `/crates?letter=a` for that.

With this change, our performance should be back to reasonable levels
for all search terms.

Before
--

```
                                                                         QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=3895.90..3896.04 rows=20 width=882) (actual time=164.823..164.838 rows=20 loops=1)
   ->  WindowAgg  (cost=3895.90..3985.84 rows=12848 width=882) (actual time=164.821..164.832 rows=20 loops=1)
         ->  Sort  (cost=3895.90..3902.32 rows=12848 width=874) (actual time=155.012..156.425 rows=12599 loops=1)
               Sort Key: ((replace(lower((crates.name)::text), '-'::text, '_'::text) = 'a'::text)) DESC, crates.name
               Sort Method: quicksort  Memory: 9996kB
               ->  Hash Right Join  (cost=3410.76..3720.54 rows=12848 width=874) (actual time=95.457..116.592 rows=12599 loops=1)
                     Hash Cond: (recent_crate_downloads.crate_id = crates.id)
                     ->  Seq Scan on recent_crate_downloads  (cost=0.00..276.87 rows=25958 width=12) (actual time=0.012..2.753 rows=25958 loops=1)
                     ->  Hash  (cost=3365.79..3365.79 rows=12848 width=865) (actual time=95.417..95.417 rows=12599 loops=1)
                           Buckets: 16384  Batches: 1  Memory Usage: 6985kB
                           ->  Seq Scan on crates  (cost=0.00..3365.79 rows=12848 width=865) (actual time=0.015..85.416 rows=12599 loops=1)
                                 Filter: ((''::tsquery @@ textsearchable_index_col) OR (replace(lower((name)::text), '-'::text, '_'::text) ~~ '%a%'::text))
                                 Rows Removed by Filter: 13359
 Planning Time: 0.555 ms
 Execution Time: 165.998 ms
```

After
--

```
                                                                                   QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=159.61..159.75 rows=20 width=882) (actual time=0.321..0.325 rows=9 loops=1)
   ->  WindowAgg  (cost=159.61..159.80 rows=26 width=882) (actual time=0.320..0.324 rows=9 loops=1)
         ->  Sort  (cost=159.61..159.63 rows=26 width=874) (actual time=0.310..0.311 rows=9 loops=1)
               Sort Key: ((replace(lower((crates.name)::text), '-'::text, '_'::text) = 'a'::text)) DESC, crates.name
               Sort Method: quicksort  Memory: 30kB
               ->  Nested Loop Left Join  (cost=10.10..159.49 rows=26 width=874) (actual time=0.076..0.288 rows=9 loops=1)
                     ->  Bitmap Heap Scan on crates  (cost=10.04..59.84 rows=26 width=865) (actual time=0.057..0.196 rows=9 loops=1)
                           Recheck Cond: ((''::tsquery @@ textsearchable_index_col) OR (replace(lower((name)::text), '-'::text, '_'::text) %> 'a'::text))
                           Heap Blocks: exact=9
                           ->  BitmapOr  (cost=10.04..10.04 rows=26 width=0) (actual time=0.046..0.046 rows=0 loops=1)
                                 ->  Bitmap Index Scan on index_crates_name_search  (cost=0.00..0.00 rows=1 width=0) (actual time=0.001..0.001 rows=0 loops=1)
                                       Index Cond: (''::tsquery @@ textsearchable_index_col)
                                 ->  Bitmap Index Scan on index_crates_name_tgrm  (cost=0.00..10.04 rows=26 width=0) (actual time=0.044..0.044 rows=9 loops=1)
                                       Index Cond: (replace(lower((name)::text), '-'::text, '_'::text) %> 'a'::text)
                     ->  Index Scan using recent_crate_downloads_crate_id on recent_crate_downloads  (cost=0.06..3.83 rows=1 width=12) (actual time=0.008..0.008 rows=1 loops=9)
                           Index Cond: (crate_id = crates.id)
 Planning Time: 0.553 ms
 Execution Time: 0.386 ms
```
@bors
Copy link
Contributor

bors commented May 23, 2019

☀️ Test successful - checks-travis
Approved by: jtgeibel
Pushing 727788d to master...

@bors bors merged commit 36893f6 into rust-lang:master May 23, 2019
@sgrif
Copy link
Contributor Author

sgrif commented May 23, 2019 via email

@jtgeibel
Copy link
Member

I opened a pr so they send a UA at least

Awesome!

@sgrif sgrif deleted the sg-stricter-search-for-short-queries branch August 14, 2019 17:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants