Fix flaky SpillPool channel test by synchronizing reader and writer tasks #19110
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
spill::spill_pool::channel#19058.Rationale for this change
The
spill_poolchannel testtest_reader_catches_up_to_writerwas flaky due to non-deterministic coordination between the reader and writer tasks. The test used time-based sleeps and polling on shared state to infer when the reader had started and when it had processed a batch. Under varying scheduler timing, this could cause the reader to miss events or observe them in a different order, leading to intermittent failures where the recorded event sequence did not match expectations (for example, observing3instead of5reads).Since this test verifies the correctness and wakeup behavior of the spill channel used by the spill pool, flakiness here undermines confidence in the spill mechanism and can cause spurious CI failures.
This PR makes the test coordination explicit and deterministic using
oneshotchannels, and also improves the usage example for the spill channel to show how to run writer and reader concurrently in a robust way.What changes are included in this PR?
Example: concurrent writer and reader usage
Update the
spill_pool::channelusage example to:writer.push_batch(&batch)?so the example returns aResultand propagates errors correctly.drop(writer)at the end of the writer task to finalize the spill file and wake the reader.tokio::join!to await both tasks and map join errors intoDataFusionError::Execution.batches_read == 5).The updated example better demonstrates the intended concurrent usage pattern of the spill channel and ensures the reader is correctly woken when the writer finishes.
Test: make
test_reader_catches_up_to_writerdeterministicIntroduce two
oneshotchannels:reader_waiting_tx/rxto signal when the reader has started and is pending on its firstnext()call.first_read_done_tx/rxto signal when the reader has completed processing the first batch.In the reader task:
ReadStartand send onreader_waiting_txbefore awaitingreader.next().first_read_done_tx.In the test body:
reader_waiting_rxinstead of sleeping for a fixed duration, ensuring the reader is actually pending before writing the first batch.first_read_done_rxbefore issuing the second write.This establishes a precise and documented sequence of events:
next().With this explicit synchronization, the event ordering in the test is stable and no longer depends on scheduler timing or arbitrary sleeps, eliminating the flakiness.
Are these changes tested?
Yes.
The modified test
test_reader_catches_up_to_writercontinues to run as part of the existingspill_pooltest suite, but now uses explicit synchronization instead of timing-based assumptions.The test has been exercised repeatedly to confirm that:
3vs5) no longer occur.The updated example code compiles and type-checks by returning
datafusion_common::Resultfrom both spawned tasks and from the combinedtokio::join!result.Are there any user-facing changes?
There are no behavior changes to the public API or spill pool semantics.
LLM-generated code disclosure
This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.