Skip to content

Conversation

@dranikpg
Copy link
Contributor

@dranikpg dranikpg commented Dec 1, 2025

Intro:
Only Stephan worked on this code with me as the reviewer, so a quick recap:

Range tree is like a binary search tree that stores double lower_bound -> sorted_vector <pair<double, DocId>> nodes, so we use lower_bound to determine for a new value into which sorted vector to insert. When those vectors (blocks) become too large, we split them. See header comment for more details

Cause:
Currently, if a block grows beyond a defined max size, we split it by the median value into a left and right part. Documents with the same value have to be inside one block - so the split can be uneven. If we add the same value to the tree over and over, we will get a single block that is split over and over - yet the split operation doesn't make it smaller at all, everythign ends up in the same block. It creates a doom loop where on each insertion of the same value the block is "split" over and over without results, making it a hugely expensive operation.

Solution:

  1. I added a max_seen parameter to detect cases where the full block surely consists only of one value
  2. If the blocks lower and max seen bounds are equal - it surely has only one kind of value - so we can avoid splitting it, we call it a large "monoblock"
  3. If we try to insert a different value into a monoblock, we don't do it - we already know how it will split (into N:1) and just create a new block
  4. Yes, the max_seen parameter is only an estimation and doesn't update on deletes, as well as the lower bound, so the optimization ins't definite - however Split() now acts in a way as a "purifier" by recomputing lower-upper bounds, so after a few inefficient splits the bounds will be sufficiently tight to separate monovalue blocks
  5. Added "block collapsing" to avoid keeping small almost empty blocks

Relevant to #6120

Comment on lines 112 to 124
// TODO: maybe merging blocks if they are too small
// The problem that for each mutable operation we do Remove and then Add,
// So we can do merge and split for one operation.
// Or in common cases users do not remove a lot of documents?
// Merge with left block if both are relatively small and won't be forced to split soon
if (block.size() < max_range_block_size_ / 4 && it != entries_.begin()) {
auto lit = --it;
auto& lblock = lit->second;
if (block.Size() + lblock.Size() < max_range_block_size_ / 2) {
for (auto e : block)
lblock.Insert(e);
entries_.erase(it);
}
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not having any kind of compaction is also problematic, especially for long running workloads

@dranikpg dranikpg marked this pull request as ready for review December 2, 2025 12:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant