Skip to content

Use a more sane tokenizer for source code search #32220

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bsofiato opened this issue Oct 8, 2024 · 4 comments · Fixed by #32261
Closed

Use a more sane tokenizer for source code search #32220

bsofiato opened this issue Oct 8, 2024 · 4 comments · Fixed by #32261
Labels
type/proposal The new feature has not been accepted yet but needs to be discussed first.

Comments

@bsofiato
Copy link
Contributor

bsofiato commented Oct 8, 2024

Feature Description

As of today, the elastic search search uses the default analizer when indexing the source code contents. This implementation uses whitespaces to break the tokens.

I feel this approach is not particularly suitable for source code search. To illustrate the issue, let us consider the code snippet below:

public baz(Foo foo) {
   return foo.bar();
}

It is fair to think that searching for bar returns the code above. As of today, however, this is not the case: ES will assume that foo.bar() is a single token. As such, ES will not match the criterion bar.

I suggest we use the pattern tokenizer instead. It uses regular expressions to separate tokens. By default, it uses any (non-word character as a token separator). In such a case, the snippet foo.bar() would yield two tokens -- foo and bar (the second token will match the given criterion).

What do you guys think?

Screenshots

No response

@bsofiato bsofiato added the type/proposal The new feature has not been accepted yet but needs to be discussed first. label Oct 8, 2024
@bsofiato bsofiato changed the title Use a more sane tokenizer for content Use a more sane tokenizer for source code Oct 8, 2024
@bsofiato bsofiato changed the title Use a more sane tokenizer for source code Use a more sane tokenizer for source code search Oct 9, 2024
@bsofiato
Copy link
Contributor Author

Hey guys, an update on this issue.

At my workplace, we run a Gitea instance with about 3K repositories. Our L2 and L3 support teams (about 200 people) rely heavily on the code search feature on Gitea. They asked me if I could make the code search case insensitive.

I was thinking of allowing this kind of search. In such a case, the results would have less relevance than matches on with the cases match.

What do you guys think? Do you think it is worthwhile to have this in Gitea's main line? If you guys are cool with it, I'll change the PR #32261 to handle it as well.

@lunny
Copy link
Member

lunny commented Oct 22, 2024

I think currently it's already Case insensitive? At least for bleve engine. Maybe you mean Case Sensitive? If that, maybe we can have an option or filter for that. Looks like Github code search doesn't support case sensitive search.

@bsofiato
Copy link
Contributor Author

Yeah, as a matter of fact, in bleve's case it is indeed already case insensitive (as the screenshot below shows).

image

However, the ES backend the content field is not normalized. So for ES (which is our case) it is indeed case sensitive (as shown below).

image

If it is ok with you guys, I'll update the #32261 to make the ES content field case insensitive like bleve's. What do you think ?

P.S. Another option would be to create another PR just for this fix. But I think that rebuilding the index might be too much for a patch version.

@bsofiato
Copy link
Contributor Author

@lunny I've pushed some changes to #32261 to make ES' search case insensitive :)

@lunny lunny closed this as completed in f64fbd9 Nov 6, 2024
matera-bs pushed a commit to matera-ar/gitea that referenced this issue Nov 14, 2024
…matching when search for code snippets (go-gitea#32261)

This PR improves the accuracy of Gitea's code search.

Currently, Gitea does not consider statements such as
`onsole.log("hello")` as hits when the user searches for `log`. The
culprit is how both ES and Bleve are tokenizing the file contents (in
both cases, `console.log` is a whole token).

In ES' case, we changed the tokenizer to
[simple_pattern_split](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-simplepatternsplit-tokenizer.html#:~:text=The%20simple_pattern_split%20tokenizer%20uses%20a,the%20tokenization%20is%20generally%20faster.).
In such a case, tokens are words formed by digits and letters. In
Bleve's case, it employs a
[letter](https://blevesearch.com/docs/Tokenizers/) tokenizer.

Resolves go-gitea#32220

---------

Signed-off-by: Bruno Sofiato <[email protected]>
matera-bs pushed a commit to matera-ar/gitea that referenced this issue Dec 17, 2024
…matching when search for code snippets (go-gitea#32261)

This PR improves the accuracy of Gitea's code search.

Currently, Gitea does not consider statements such as
`onsole.log("hello")` as hits when the user searches for `log`. The
culprit is how both ES and Bleve are tokenizing the file contents (in
both cases, `console.log` is a whole token).

In ES' case, we changed the tokenizer to
[simple_pattern_split](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-simplepatternsplit-tokenizer.html#:~:text=The%20simple_pattern_split%20tokenizer%20uses%20a,the%20tokenization%20is%20generally%20faster.).
In such a case, tokens are words formed by digits and letters. In
Bleve's case, it employs a
[letter](https://blevesearch.com/docs/Tokenizers/) tokenizer.

Resolves go-gitea#32220

---------

Signed-off-by: Bruno Sofiato <[email protected]>
@go-gitea go-gitea locked as resolved and limited conversation to collaborators Feb 5, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
type/proposal The new feature has not been accepted yet but needs to be discussed first.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants