Add spilling to RepartitionExec #18014

adriangb · 2025-10-10T19:12:02Z

I ran into this using datafusion-distributed which I think makes the issue of partition execution time skew even more likely to happen. As per that issue it can also happen with non-distributed queries, e.g. if one partition's sort spills and others don't.

Due to the nature of ReparitionExec I don't think we can bound the channels, that could lead to deadlocks. So what I did was at least make queries that would have previously fail continue forward with disk spilling. I did not account for memory usage when reading batches back from disk since DataFusion in general does not generally account for "in-flight" batches.

Written with help from Claude

alamb

Thanks @adriangb -- can you please add some "end to end" tests (aka that run a real query / LogicalPlan with a memory limit) and show that it still works?

adriangb · 2025-10-10T21:33:17Z

Good call. I'll have to think about how we can force RepartitionExec specifically to spill. I've only experienced it with large datasets, lopsided hash keys, etc

comphead · 2025-10-10T22:00:01Z

Should we get this PR as draft for now before spilling tests added?

milenkovicm · 2025-10-10T22:06:05Z

datafusion/physical-plan/src/repartition/mod.rs

+                                (RepartitionBatch::Memory(batch), true)
+                            }
+                            Err(_) => {
+                                // We're memory limited - spill this single batch to its own file


would we avoid out of memory error but risk "to many open files" with this approach, or I'm missing something?

Do you mean because of spilling in general or the choice to spill each batch to its own file? In general DataFusion doesn't track number of files. I'm not sure if we keep the files open between writing and reading, I'd guess not. If we close them we only need as many file descriptors as files we spill concurrently - which should not be many (~ number of partitions)

yes, having a file per batch can get process to get killed (by the os) due to too many open files, also there are like 4 system calls per batch involved.

closing and then opening files for will bring at least 6 system calls per batch.

Ideally if we can keep file open and then share offsets after write, would save lot of syscals, plus we could keep a file per partition. 🤔

yes, having a file per batch can get process to get killed (by the os) due to too many open files

I checked and indeed we close the files after writing so this is not going to happen. the only thing we accumulate is PathBufs, not open file descriptors.

closing and then opening files for will bring at least 6 system calls per batch
Ideally if we can keep file open and then share offsets after write, would save lot of syscals

I think that would be nice but I can't think of a scheme that would make that work: reading from the end of the file isn't compatible with our use case because we need it to be FIFO but using a single file only really works if you do LIFO. Do you have any ideas on how we could do this in a relatively simple way?

Given that any query that would spill here would have previously errored out I'm not too worried about performance. And frankly I think if batch sizes are reasonable a couple extra sys calls won't be measurable.

That said since how the spilling happens is all very internal / private there's no reason we can't merge this and then come back and improve it later.

Why would file be LIFO if you send offset, I'm not sure I understand. Writing single batch to a file,opening closing still seams excessive to me

I'm not sure that keeping old batches on disk wouldn't be problematic. That ticket isn't for RepartitionExec but it does show that there is a tradeoff between trying to save on system calls and overall disk usage.

Shuffle exchange in spark or ballista will spill whole repartition to disk if I'm not mistaken I guess spilling in this case would not be different, apart from fact that this is triggered due to memory constraints.

Maybe we could learn something from new real-time mode in spark https://docs.google.com/document/d/1CvJvtlTGP6TwQIT4kW6GFT1JbdziAYOBvt60ybb7Dw8/edit?usp=sharing

this is spark jira https://issues.apache.org/jira/browse/SPARK-52330 with link to design document https://docs.google.com/document/d/1CvJvtlTGP6TwQIT4kW6GFT1JbdziAYOBvt60ybb7Dw8/edit?usp=sharing

One alternative could be to do a spill file per-channel and have some sort of gc process where we say "if the spill file exceeds XGB and/or we have more than YGB of junk space we pay the price of copying over into a new spill file to keep disk usage from blowing up". That might be a general approach to handle cases like #18011 as well. But I'm not sure the complexity is worth it. If that happens even once I feel that it will vastly exceed the cost of the extra sys calls to create and delete files.

Just one idea, would it be possible to spill before actual repartition? Try to acquire memory equal to received batch, if it fails do not do repartition, spill it. That would produce just N file (where N is number of partition)

adriangb · 2025-10-10T22:09:28Z

Should we get this PR as draft for now before spilling tests added?

Sure thing. Note that there are tests, just not e2e SLT tests.

adriangb · 2025-10-10T22:10:21Z

Should we get this PR as draft for now before spilling tests added?

Sure thing. Note that there are tests, just not e2e SLT tests.

Not sure how to do it from mobile. Will do next time I'm at my computer.

adriangb · 2025-10-11T12:39:52Z

@alamb I cooked up an e2e test, let me know if it looks okay: https://github.com/apache/datafusion/pull/18014/files#diff-e8fa07bcbe7db22bdd43eacfa713d380acaf651770da26cb06957030034ebb93

datafusion/core/tests/memory_limit/repartition_mem_limit.rs

Co-authored-by: Bruce Ritchie <[email protected]>

2010YOUY01 · 2025-10-16T10:00:52Z

I have a question: are we assuming that it's not possible to make RepartitionExec memory-constant (i.e., O(n_partitions * batch_size) memory)? Is this due to an engineering limitation, or is it theoretically impossible?
If we can make RepartitionExec only use constant memory, is this spilling RepartiitonExec still needed for datafusion-distributed usage?

adriangb · 2025-10-16T12:09:36Z

I have a question: are we assuming that it's not possible to make RepartitionExec memory-constant (i.e., O(n_partitions * batch_size) memory)? Is this due to an engineering limitation, or is it theoretically impossible?

If we can make RepartitionExec only use constant memory, is this spilling RepartiitonExec still needed for datafusion-distributed usage?

I think it's theoretically impossible. If one partition is faster than another the data flowing into the slow partition has to accumulate somewhere. I consider the possibility of making the channels bounded, but I'm afraid that would result in deadlocks.

github-actions bot added the physical-plan Changes to the physical-plan crate label Oct 10, 2025

adriangb changed the title ~~Spilling repartition~~ Add spilling to RepartitionExec Oct 10, 2025

alamb reviewed Oct 10, 2025

View reviewed changes

milenkovicm reviewed Oct 10, 2025

View reviewed changes

adriangb marked this pull request as draft October 10, 2025 22:10

github-actions bot added the core Core DataFusion crate label Oct 11, 2025

Omega359 reviewed Oct 11, 2025

View reviewed changes

datafusion/core/tests/memory_limit/repartition_mem_limit.rs Outdated Show resolved Hide resolved

adriangb marked this pull request as ready for review October 13, 2025 15:34

adriangb force-pushed the spilling-repartition branch from f9296b6 to 4091979 Compare October 13, 2025 16:23

adriangb and others added 8 commits October 15, 2025 06:20

initial implementation of spilling RepartitionExec

ec44ef7

add spilling to RepartitionExec

248fe51

lint

33e14d7

fix

307db53

add e2e test

4cfc2f5

fix typo

446a614

Update datafusion/core/tests/memory_limit/repartition_mem_limit.rs

4c116ba

Co-authored-by: Bruce Ritchie <[email protected]>

handle empty batches, add note about expect

bcfd64a

adriangb force-pushed the spilling-repartition branch from 4091979 to bcfd64a Compare October 15, 2025 11:20

Add spilling to RepartitionExec #18014

Are you sure you want to change the base?

Add spilling to RepartitionExec #18014

Conversation

adriangb commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

adriangb commented Oct 10, 2025

Uh oh!

comphead commented Oct 10, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

milenkovicm Oct 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adriangb commented Oct 10, 2025

Uh oh!

adriangb commented Oct 10, 2025

Uh oh!

adriangb commented Oct 11, 2025

Uh oh!

Uh oh!

2010YOUY01 commented Oct 16, 2025

Uh oh!

adriangb commented Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

adriangb commented Oct 10, 2025 •

edited

Loading

milenkovicm Oct 11, 2025 •

edited

Loading