Use a more sane tokenizer for source code search #32220

bsofiato · 2024-10-08T21:02:40Z

Feature Description

As of today, the elastic search search uses the default analizer when indexing the source code contents. This implementation uses whitespaces to break the tokens.

I feel this approach is not particularly suitable for source code search. To illustrate the issue, let us consider the code snippet below:

public baz(Foo foo) {
   return foo.bar();
}

It is fair to think that searching for bar returns the code above. As of today, however, this is not the case: ES will assume that foo.bar() is a single token. As such, ES will not match the criterion bar.

I suggest we use the pattern tokenizer instead. It uses regular expressions to separate tokens. By default, it uses any (non-word character as a token separator). In such a case, the snippet foo.bar() would yield two tokens -- foo and bar (the second token will match the given criterion).

What do you guys think?

Screenshots

No response

The text was updated successfully, but these errors were encountered:

bsofiato · 2024-10-22T02:25:43Z

Hey guys, an update on this issue.

At my workplace, we run a Gitea instance with about 3K repositories. Our L2 and L3 support teams (about 200 people) rely heavily on the code search feature on Gitea. They asked me if I could make the code search case insensitive.

I was thinking of allowing this kind of search. In such a case, the results would have less relevance than matches on with the cases match.

What do you guys think? Do you think it is worthwhile to have this in Gitea's main line? If you guys are cool with it, I'll change the PR #32261 to handle it as well.

lunny · 2024-10-22T03:27:03Z

I think currently it's already Case insensitive? At least for bleve engine. Maybe you mean Case Sensitive? If that, maybe we can have an option or filter for that. Looks like Github code search doesn't support case sensitive search.

bsofiato · 2024-10-22T04:19:09Z

Yeah, as a matter of fact, in bleve's case it is indeed already case insensitive (as the screenshot below shows).

However, the ES backend the content field is not normalized. So for ES (which is our case) it is indeed case sensitive (as shown below).

If it is ok with you guys, I'll update the #32261 to make the ES content field case insensitive like bleve's. What do you think ?

P.S. Another option would be to create another PR just for this fix. But I think that rebuilding the index might be too much for a patch version.

bsofiato · 2024-10-23T14:41:09Z

@lunny I've pushed some changes to #32261 to make ES' search case insensitive :)

…matching when search for code snippets (go-gitea#32261) This PR improves the accuracy of Gitea's code search. Currently, Gitea does not consider statements such as `onsole.log("hello")` as hits when the user searches for `log`. The culprit is how both ES and Bleve are tokenizing the file contents (in both cases, `console.log` is a whole token). In ES' case, we changed the tokenizer to [simple_pattern_split](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-simplepatternsplit-tokenizer.html#:~:text=The%20simple_pattern_split%20tokenizer%20uses%20a,the%20tokenization%20is%20generally%20faster.). In such a case, tokens are words formed by digits and letters. In Bleve's case, it employs a [letter](https://blevesearch.com/docs/Tokenizers/) tokenizer. Resolves go-gitea#32220 --------- Signed-off-by: Bruno Sofiato <[email protected]>

bsofiato added the type/proposal The new feature has not been accepted yet but needs to be discussed first. label Oct 8, 2024

bsofiato changed the title ~~Use a more sane tokenizer for content~~ Use a more sane tokenizer for source code Oct 8, 2024

bsofiato changed the title ~~Use a more sane tokenizer for source code~~ Use a more sane tokenizer for source code search Oct 9, 2024

bsofiato mentioned this issue Oct 15, 2024

Updated tokenizer to better matching when search for code snippets #32261

Merged

lunny closed this as completed in #32261 Nov 6, 2024

lunny closed this as completed in f64fbd9 Nov 6, 2024

go-gitea locked as resolved and limited conversation to collaborators Feb 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use a more sane tokenizer for source code search #32220

Use a more sane tokenizer for source code search #32220

bsofiato commented Oct 8, 2024

bsofiato commented Oct 22, 2024

lunny commented Oct 22, 2024

bsofiato commented Oct 22, 2024

bsofiato commented Oct 23, 2024

Use a more sane tokenizer for source code search #32220

Use a more sane tokenizer for source code search #32220

Comments

bsofiato commented Oct 8, 2024

Feature Description

Screenshots

bsofiato commented Oct 22, 2024

lunny commented Oct 22, 2024

bsofiato commented Oct 22, 2024

bsofiato commented Oct 23, 2024