Skip to content

Commit 36893f6

Browse files
committed
Use a stricter form of search for extremely short queries
Since #1560 we've had various performance problems with search. While we've addressed most of the biggest issues, we still have poor performance when the search term is 2 characters or longer (between 100-200ms at the time of writing this commit, though there is a PR being merged which can regress this another 30%). This level of performance isn't the end of the world, but the majority of our 1 or 2 character searches come from a handful of crawlers which hit us very frequently, and I'd prefer not to spend this much time on queries for crawlers. (There's quite a few of these, they all do things like search for 's', then 'se', then 'ser', etc over and over again -- User Agent only identifies the HTTP lib they used, which varies). The performance issues come from two places. The first problem is that we can't use our indexes when the search term is 2 characters, due to how trigrams work. This means that we fall back to doing a sequential scan of the entire table, which will only get worse as time goes on. For single letter searches, the second issue comes from the sheer number of rows we get back, which have to go into an expensive hash join. If you search for 'a', you get back 13k results. At the end of the day, getting every crate with 'a' in its name is not useful, so I've tried to go with a solution that both improves our performance and also return more useful results. The operator I've used is meant to return whether any words are sufficiently similar, but since our search term is shorter than a trigram the behavior is a little different. For 2 letter searches, it ends up being "any word begins with the term", and for 1 letter searches, it's "any word is equal to the term". Here are some example results for "do" and "a" ``` do afi_docf alice-download async_docker avocado_derive cargo-build-docker cargo_crates-io_docs-rs_test cargo_crates-io_docs-rs_test2 cargo-do cargo-doc-coverage cargo-docgen cargo-dock cargo-docker cargo-docker-builder cargo-docserve cargo-docserver cargo-doctor cargo-download cargo-external-doc cargo-pack-docker cargo-serve-doc devboat-docker doapi do-async doc doc_9_testing_12345 docbase_io doc-cfg doc-comment doccy doc_file docker docker4rs ``` ``` a a-range cortex-a jacques_a_dit magic-number-a manish_this_is_a_test poke-a-mango vmx-just-a-test-001-maincrate wasm-bindgen-test-crate-a ``` Drawbacks --- The original motivation for switching to `LIKE` in search was to make sure libssh shows up when searching for ssh. This will regress that for any `lib*` crates with less than 2 letter names. There aren't very many of these: - lib - libc - libcw - libdw - libgo - libjp - libm - libnv - libpm - libr - libs - libsm - libxm I'm less concerned about the single letter cases, as those are already going to be buried on page 87, but a few of the 2 letter cases you might legitimately search for. None of these crates have high traffic, and fixing this generally isn't really possible without introducing some special case indexes *only* for this case. We could also work around this by always searching for "lib*" in addition to whatever you searched for. This also means that searching for `a` will no longer include the crate `a1`. I'm not as concerned about this, if you want all crates starting with the letter a, we already have `/crates?letter=a` for that. With this change, our performance should be back to reasonable levels for all search terms. Before -- ``` QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------ Limit (cost=3895.90..3896.04 rows=20 width=882) (actual time=164.823..164.838 rows=20 loops=1) -> WindowAgg (cost=3895.90..3985.84 rows=12848 width=882) (actual time=164.821..164.832 rows=20 loops=1) -> Sort (cost=3895.90..3902.32 rows=12848 width=874) (actual time=155.012..156.425 rows=12599 loops=1) Sort Key: ((replace(lower((crates.name)::text), '-'::text, '_'::text) = 'a'::text)) DESC, crates.name Sort Method: quicksort Memory: 9996kB -> Hash Right Join (cost=3410.76..3720.54 rows=12848 width=874) (actual time=95.457..116.592 rows=12599 loops=1) Hash Cond: (recent_crate_downloads.crate_id = crates.id) -> Seq Scan on recent_crate_downloads (cost=0.00..276.87 rows=25958 width=12) (actual time=0.012..2.753 rows=25958 loops=1) -> Hash (cost=3365.79..3365.79 rows=12848 width=865) (actual time=95.417..95.417 rows=12599 loops=1) Buckets: 16384 Batches: 1 Memory Usage: 6985kB -> Seq Scan on crates (cost=0.00..3365.79 rows=12848 width=865) (actual time=0.015..85.416 rows=12599 loops=1) Filter: ((''::tsquery @@ textsearchable_index_col) OR (replace(lower((name)::text), '-'::text, '_'::text) ~~ '%a%'::text)) Rows Removed by Filter: 13359 Planning Time: 0.555 ms Execution Time: 165.998 ms ``` After -- ``` QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=159.61..159.75 rows=20 width=882) (actual time=0.321..0.325 rows=9 loops=1) -> WindowAgg (cost=159.61..159.80 rows=26 width=882) (actual time=0.320..0.324 rows=9 loops=1) -> Sort (cost=159.61..159.63 rows=26 width=874) (actual time=0.310..0.311 rows=9 loops=1) Sort Key: ((replace(lower((crates.name)::text), '-'::text, '_'::text) = 'a'::text)) DESC, crates.name Sort Method: quicksort Memory: 30kB -> Nested Loop Left Join (cost=10.10..159.49 rows=26 width=874) (actual time=0.076..0.288 rows=9 loops=1) -> Bitmap Heap Scan on crates (cost=10.04..59.84 rows=26 width=865) (actual time=0.057..0.196 rows=9 loops=1) Recheck Cond: ((''::tsquery @@ textsearchable_index_col) OR (replace(lower((name)::text), '-'::text, '_'::text) %> 'a'::text)) Heap Blocks: exact=9 -> BitmapOr (cost=10.04..10.04 rows=26 width=0) (actual time=0.046..0.046 rows=0 loops=1) -> Bitmap Index Scan on index_crates_name_search (cost=0.00..0.00 rows=1 width=0) (actual time=0.001..0.001 rows=0 loops=1) Index Cond: (''::tsquery @@ textsearchable_index_col) -> Bitmap Index Scan on index_crates_name_tgrm (cost=0.00..10.04 rows=26 width=0) (actual time=0.044..0.044 rows=9 loops=1) Index Cond: (replace(lower((name)::text), '-'::text, '_'::text) %> 'a'::text) -> Index Scan using recent_crate_downloads_crate_id on recent_crate_downloads (cost=0.06..3.83 rows=1 width=12) (actual time=0.008..0.008 rows=1 loops=9) Index Cond: (crate_id = crates.id) ```
1 parent 16e8d71 commit 36893f6

File tree

3 files changed

+27
-8
lines changed

3 files changed

+27
-8
lines changed

src/controllers/krate/search.rs

+1-1
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,7 @@ pub fn search(req: &mut dyn Request) -> CargoResult<Response> {
6666
query = query.filter(
6767
q.clone()
6868
.matches(crates::textsearchable_index_col)
69-
.or(Crate::like_name(&q_string)),
69+
.or(Crate::loosly_matches_name(&q_string)),
7070
);
7171

7272
query = query.select((

src/models/krate.rs

+22-7
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@ use chrono::NaiveDateTime;
22
use diesel::associations::Identifiable;
33
use diesel::pg::Pg;
44
use diesel::prelude::*;
5+
use diesel::sql_types::Bool;
56
use url::Url;
67

78
use crate::app::App;
@@ -86,8 +87,6 @@ pub const MAX_NAME_LENGTH: usize = 64;
8687
type CanonCrateName<T> = self::canon_crate_name::HelperType<T>;
8788
type All = diesel::dsl::Select<crates::table, AllColumns>;
8889
type WithName<'a> = diesel::dsl::Eq<CanonCrateName<crates::name>, CanonCrateName<&'a str>>;
89-
/// The result of a loose search
90-
type LikeName = diesel::dsl::Like<CanonCrateName<crates::name>, CanonCrateName<String>>;
9190
type ByName<'a> = diesel::dsl::Filter<All, WithName<'a>>;
9291
type ByExactName<'a> = diesel::dsl::Filter<All, diesel::dsl::Eq<crates::name, &'a str>>;
9392

@@ -236,12 +235,28 @@ impl<'a> NewCrate<'a> {
236235
}
237236

238237
impl Crate {
239-
/// SQL filter with the `like` binary operator. Adds wildcards to the beginning and end to get
240-
/// substring matches.
241-
pub fn like_name(name: &str) -> LikeName {
242-
let wildcard_name = format!("%{}%", name);
243-
canon_crate_name(crates::name).like(canon_crate_name(wildcard_name))
238+
/// SQL filter based on whether the crate's name loosly matches the given
239+
/// string.
240+
///
241+
/// The operator used varies based on the input.
242+
pub fn loosly_matches_name<QS>(
243+
name: &str,
244+
) -> Box<dyn BoxableExpression<QS, Pg, SqlType = Bool> + '_>
245+
where
246+
crates::name: SelectableExpression<QS>,
247+
{
248+
if name.len() > 2 {
249+
let wildcard_name = format!("%{}%", name);
250+
Box::new(canon_crate_name(crates::name).like(canon_crate_name(wildcard_name)))
251+
} else {
252+
diesel_infix_operator!(MatchesWord, "%>");
253+
Box::new(MatchesWord::new(
254+
canon_crate_name(crates::name),
255+
name.into_sql::<Text>(),
256+
))
257+
}
244258
}
259+
245260
/// SQL filter with the = binary operator
246261
pub fn with_name(name: &str) -> WithName<'_> {
247262
canon_crate_name(crates::name).eq(canon_crate_name(name))

src/tests/krate.rs

+4
Original file line numberDiff line numberDiff line change
@@ -452,6 +452,10 @@ fn loose_search_order() {
452452
for (lhs, rhs) in search_temp.crates.iter().zip(ordered) {
453453
assert_eq!(lhs.name, rhs.name);
454454
}
455+
456+
let search_temp = anon.search("q=te");
457+
assert_eq!(search_temp.meta.total, 3);
458+
assert_eq!(search_temp.crates.len(), 3);
455459
}
456460

457461
#[test]

0 commit comments

Comments
 (0)