Skip to content

Bug found in RecursiveDocumentSplitter #9311

@sjtuldl

Description

@sjtuldl

Describe the bug
In the RecursiveDocumentSplitter implementation, there's a bug in the _chunk_text method where the current_length is incorrectly updated after a recursive call to _chunk_text. This happens inside the condition where a split_text is found to be longer than split_length, and recursive chunking is applied. The line:
[current_length += self._chunk_length(split_text)](

current_length += self._chunk_length(split_text)
)
should be removed. This incorrectly updates the current_length even though the split_text is already recursively chunked and appended to chunks. It leads to inaccurate current_length tracking, which can affect how further splits are grouped into chunks.

Error message
No exception is raised, but this results in incorrect chunking logic — chunks may be combined incorrectly or suboptimal chunk sizes may be produced.

Expected behavior
After recursively chunking split_text, we expect the resulting chunks to be appended, and current_length should be reset. Updating current_length in that case is misleading and causes improper chunk merging logic later on.

To Reproduce

from haystack import Document
from haystack.components.preprocessors import RecursiveDocumentSplitter

def test_recursive_splitter_bug():
    # Sample text crafted to trigger recursive splitting
    long_text = (
        "A" * 150 + "\n\n" +  # triggers first-level split on \n\n
        "B" * 100 + "\n"+ 'B'*105+ "\n\n" +  # this chunk exceeds split_length and goes through recursion
        "C" * 100 + "\n\n" +             # short chunk1
        "D" * 50                         # short chunk2
    )
    #print(long_text)
    doc = Document(content=long_text)
    splitter = NewRecursiveDocumentSplitter(
        split_length=200,
        split_overlap=0,
        split_unit='char',
        separators=["\n\n", "\n", " "]  # no "sentence" to keep it simple
    )
    #splitter.warm_up()

    result = splitter.run([doc])
    chunks = result["documents"]

    for idx, chunk in enumerate(chunks):
        print(f"Chunk {idx + 1} (length: {len(chunk.content)}):")
        print(repr(chunk.content))
        print("---")


if __name__ == "__main__":
    test_recursive_splitter_bug()

current_output:

Chunk 1 (length: 152):
'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA\n\n'
---
Chunk 2 (length: 101):
'BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB\n'
---
Chunk 3 (length: 107):
'BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB\n\n'
---
Chunk 4 (length: 102):
'CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC\n\n'
---
Chunk 5 (length: 50):
'DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD'
---

the Chunk4 and Chunk5 should be together after splitting

FAQ Check

System:

  • OS:Rocky linux 9.5
  • GPU/CPU: NVIDIA 3090/AMD 7950X
  • Haystack version (commit or version number):2.12.0
  • DocumentStore:Qdrant
  • Reader:
  • Retriever:

Metadata

Metadata

Assignees

Labels

P1High priority, add to the next sprinttype:bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions