Skip to content

Search should split crate names on Underscore when indexing #1549

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
behnam opened this issue Nov 4, 2018 · 3 comments
Closed

Search should split crate names on Underscore when indexing #1549

behnam opened this issue Nov 4, 2018 · 3 comments

Comments

@behnam
Copy link
Contributor

behnam commented Nov 4, 2018

Looking at the search result for static (https://crates.io/search?q=static), you can see that the term is not matching lazy_static's crate name, down-ranking the crate to the end of the page, although it's the most popular static-related crate bar far.

Looking at the search result for lazy_static (https://crates.io/search?q=lazy_static), it looks like we already split on Underscore the search terms.

I think we should do the same, splitting on Underscore, when indexing create names, so static query better matches lazy_static.

WDYT?

@kzys
Copy link
Contributor

kzys commented Apr 16, 2019

@sgrif @FreeMasen Wouldn't be covered by #1560?

@carols10cents
Copy link
Member

I think this will be covered by #1560, which I just queued up to be merged. I'll verify this is fixed when #1560 gets deployed, just in case, and I'll close this issue then.

@sgrif
Copy link
Contributor

sgrif commented May 13, 2019

#1560 doesn't really affect this, as it just affects what's included in the results at all -- lazy_static was already in the results.

I think we should do the same, splitting on Underscore, when indexing create names, so static query better matches lazy_static.

Splitting by underscores also won't do anything here, PG already handles underscores the way you'd expect.

image

Sorting by relevance uses the the PG full text search ranking functions with the weights we provided (name is A, keywords is B, description is C, name is D). We don't override the numerical weights used, so those will get weighted as 1.0, 0.4, 0.2, 0.1. You can read the details of how the matching works at https://www.postgresql.org/docs/current/textsearch-controls.html#TEXTSEARCH-RANKING, but the short version is that it will rank based on how frequently the search term appears, and how far apart the appearance of those terms is.

The issue here isn't that we're doing anything to make the lazy_static crate match the term static poorly. It's properly indexed as 'static':2A,5B,12C,22,33,190,211,217,218,223 -- once in the name, once in the keywords, once in the description, 7 times in the readme (if you ctrl+f "static" you'll see 9 results in the README, but the 2 lazy-static.rs appearances are considered their own word since it's a URL).

Compare this to the first result, which has this in its index: 'static':3A,7B,13C,18C,31,47,52,103,108,158,163,168,173,207,212,228,235,237,243,269,276,278,313,318,326,329,338,341,356,363,365,372,393,398,423,430,432,441,452,490,495,530,537,539,577,584,586,660,667,669,674,704,730,742,851. If you look at its README, you'll see that it's just much more densely packed with the term static throughout than lazy_static is.

We could play with the normalization options, but most of them just push lazy_static back further since they generally cause crates with short READMEs like https://crates.io/crates/graphy_static up higher. Ultimately the root problem here is that lazy_static isn't really well optimized for the term static, but other crates are.

down-ranking the crate to the end of the page, although it's the most popular static-related crate bar far.

Sorting by "popularity" is a completely different topic, and much more difficult one to solve. I won't go into all the ideas/problems again here, but ultimately the "relevance" sorting seems to be working as intended. At the very least there's nothing to be gained by splitting on underscores, so I'm going to close this.

@sgrif sgrif closed this as completed May 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants