-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Description
Describe the bug
In the RecursiveDocumentSplitter implementation, there's a bug in the _chunk_text method where the current_length is incorrectly updated after a recursive call to _chunk_text. This happens inside the condition where a split_text is found to be longer than split_length, and recursive chunking is applied. The line:
[current_length += self._chunk_length(split_text)](
current_length += self._chunk_length(split_text) |
should be removed. This incorrectly updates the current_length even though the split_text is already recursively chunked and appended to chunks. It leads to inaccurate current_length tracking, which can affect how further splits are grouped into chunks.
Error message
No exception is raised, but this results in incorrect chunking logic — chunks may be combined incorrectly or suboptimal chunk sizes may be produced.
Expected behavior
After recursively chunking split_text, we expect the resulting chunks to be appended, and current_length should be reset. Updating current_length in that case is misleading and causes improper chunk merging logic later on.
To Reproduce
from haystack import Document
from haystack.components.preprocessors import RecursiveDocumentSplitter
def test_recursive_splitter_bug():
# Sample text crafted to trigger recursive splitting
long_text = (
"A" * 150 + "\n\n" + # triggers first-level split on \n\n
"B" * 100 + "\n"+ 'B'*105+ "\n\n" + # this chunk exceeds split_length and goes through recursion
"C" * 100 + "\n\n" + # short chunk1
"D" * 50 # short chunk2
)
#print(long_text)
doc = Document(content=long_text)
splitter = NewRecursiveDocumentSplitter(
split_length=200,
split_overlap=0,
split_unit='char',
separators=["\n\n", "\n", " "] # no "sentence" to keep it simple
)
#splitter.warm_up()
result = splitter.run([doc])
chunks = result["documents"]
for idx, chunk in enumerate(chunks):
print(f"Chunk {idx + 1} (length: {len(chunk.content)}):")
print(repr(chunk.content))
print("---")
if __name__ == "__main__":
test_recursive_splitter_bug()
current_output:
Chunk 1 (length: 152):
'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA\n\n'
---
Chunk 2 (length: 101):
'BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB\n'
---
Chunk 3 (length: 107):
'BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB\n\n'
---
Chunk 4 (length: 102):
'CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC\n\n'
---
Chunk 5 (length: 50):
'DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD'
---
the Chunk4 and Chunk5 should be together after splitting
FAQ Check
- Have you had a look at our new FAQ page?
System:
- OS:Rocky linux 9.5
- GPU/CPU: NVIDIA 3090/AMD 7950X
- Haystack version (commit or version number):2.12.0
- DocumentStore:Qdrant
- Reader:
- Retriever: