fix(search): Fix numeric tree splits #6143

dranikpg · 2025-12-01T14:12:31Z

Intro:
Only Stephan worked on this code with me as the reviewer, so a quick recap:

Range tree is like a binary search tree that stores double lower_bound -> sorted_vector <pair<double, DocId>> nodes, so we use lower_bound to determine for a new value into which sorted vector to insert. When those vectors (blocks) become too large, we split them. See header comment for more details

Cause:
Currently, if a block grows beyond a defined max size, we split it by the median value into a left and right part. Documents with the same value have to be inside one block - so the split can be uneven. If we add the same value to the tree over and over, we will get a single block that is split over and over - yet the split operation doesn't make it smaller at all, everythign ends up in the same block. It creates a doom loop where on each insertion of the same value the block is "split" over and over without results, making it a hugely expensive operation.

Solution:

I added a max_seen parameter to detect cases where the full block surely consists only of one value
If the blocks lower and max seen bounds are equal - it surely has only one kind of value - so we can avoid splitting it, we call it a large "monoblock"
If we try to insert a different value into a monoblock, we don't do it - we already know how it will split (into N:1) and just create a new block
Yes, the max_seen parameter is only an estimation and doesn't update on deletes, as well as the lower bound, so the optimization ins't definite - however Split() now acts in a way as a "purifier" by recomputing lower-upper bounds, so after a few inefficient splits the bounds will be sufficiently tight to separate monovalue blocks
Added "block collapsing" to avoid keeping small almost empty blocks

Relevant to #6120

dranikpg · 2025-12-01T14:13:37Z

src/core/search/range_tree.cc

-  // TODO: maybe merging blocks if they are too small
-  // The problem that for each mutable operation we do Remove and then Add,
-  // So we can do merge and split for one operation.
-  // Or in common cases users do not remove a lot of documents?
+  // Merge with left block if both are relatively small and won't be forced to split soon
+  if (block.size() < max_range_block_size_ / 4 && it != entries_.begin()) {
+    auto lit = --it;
+    auto& lblock = lit->second;
+    if (block.Size() + lblock.Size() < max_range_block_size_ / 2) {
+      for (auto e : block)
+        lblock.Insert(e);
+      entries_.erase(it);
+    }
+  }
 }


Not having any kind of compaction is also problematic, especially for long running workloads

Signed-off-by: Vladislav Oleshko <[email protected]>

fix(search): Fix numeric tree splits

cb49eb5

dranikpg commented Dec 1, 2025

View reviewed changes

dranikpg added 2 commits December 2, 2025 12:21

Add stats, fix split algorithm

c0c1583

Signed-off-by: Vladislav Oleshko <[email protected]>

more fixes

0f2330f

dranikpg marked this pull request as ready for review December 2, 2025 12:04

more more fixes

41d1e25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(search): Fix numeric tree splits #6143

fix(search): Fix numeric tree splits #6143

dranikpg commented Dec 1, 2025 •

edited

Loading

Uh oh!

dranikpg Dec 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fix(search): Fix numeric tree splits #6143

Are you sure you want to change the base?

fix(search): Fix numeric tree splits #6143

Conversation

dranikpg commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dranikpg Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dranikpg commented Dec 1, 2025 •

edited

Loading