-
-
Notifications
You must be signed in to change notification settings - Fork 5.8k
Use a more sane tokenizer for source code search #32220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hey guys, an update on this issue. At my workplace, we run a Gitea instance with about 3K repositories. Our L2 and L3 support teams (about 200 people) rely heavily on the code search feature on Gitea. They asked me if I could make the code search case insensitive. I was thinking of allowing this kind of search. In such a case, the results would have less relevance than matches on with the cases match. What do you guys think? Do you think it is worthwhile to have this in Gitea's main line? If you guys are cool with it, I'll change the PR #32261 to handle it as well. |
I think currently it's already Case insensitive? At least for bleve engine. Maybe you mean Case Sensitive? If that, maybe we can have an option or filter for that. Looks like Github code search doesn't support case sensitive search. |
Yeah, as a matter of fact, in bleve's case it is indeed already case insensitive (as the screenshot below shows). However, the ES backend the content field is not normalized. So for ES (which is our case) it is indeed case sensitive (as shown below). If it is ok with you guys, I'll update the #32261 to make the ES content field case insensitive like bleve's. What do you think ? P.S. Another option would be to create another PR just for this fix. But I think that rebuilding the index might be too much for a patch version. |
…matching when search for code snippets (go-gitea#32261) This PR improves the accuracy of Gitea's code search. Currently, Gitea does not consider statements such as `onsole.log("hello")` as hits when the user searches for `log`. The culprit is how both ES and Bleve are tokenizing the file contents (in both cases, `console.log` is a whole token). In ES' case, we changed the tokenizer to [simple_pattern_split](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-simplepatternsplit-tokenizer.html#:~:text=The%20simple_pattern_split%20tokenizer%20uses%20a,the%20tokenization%20is%20generally%20faster.). In such a case, tokens are words formed by digits and letters. In Bleve's case, it employs a [letter](https://blevesearch.com/docs/Tokenizers/) tokenizer. Resolves go-gitea#32220 --------- Signed-off-by: Bruno Sofiato <[email protected]>
…matching when search for code snippets (go-gitea#32261) This PR improves the accuracy of Gitea's code search. Currently, Gitea does not consider statements such as `onsole.log("hello")` as hits when the user searches for `log`. The culprit is how both ES and Bleve are tokenizing the file contents (in both cases, `console.log` is a whole token). In ES' case, we changed the tokenizer to [simple_pattern_split](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-simplepatternsplit-tokenizer.html#:~:text=The%20simple_pattern_split%20tokenizer%20uses%20a,the%20tokenization%20is%20generally%20faster.). In such a case, tokens are words formed by digits and letters. In Bleve's case, it employs a [letter](https://blevesearch.com/docs/Tokenizers/) tokenizer. Resolves go-gitea#32220 --------- Signed-off-by: Bruno Sofiato <[email protected]>
Feature Description
As of today, the elastic search search uses the default analizer when indexing the source code contents. This implementation uses whitespaces to break the tokens.
I feel this approach is not particularly suitable for source code search. To illustrate the issue, let us consider the code snippet below:
It is fair to think that searching for
bar
returns the code above. As of today, however, this is not the case: ES will assume thatfoo.bar()
is a single token. As such, ES will not match the criterionbar
.I suggest we use the pattern tokenizer instead. It uses regular expressions to separate tokens. By default, it uses any (non-word character as a token separator). In such a case, the snippet
foo.bar()
would yield two tokens --foo
andbar
(the second token will match the given criterion).What do you guys think?
Screenshots
No response
The text was updated successfully, but these errors were encountered: