-
-
Notifications
You must be signed in to change notification settings - Fork 741
Description
The P2P shuffling extension currently has a hard requirement on disk. Effectively the entire data is written to and read from disk at least once.
This works roughly by
- Serialize a batch of shards on the transferring side
- Deserialize batch of shards on the receiving side
- Group those shards by output partition
- Serialize those output partitions again
- Write the output partitions to disk
- Once required, read the output partition from disk again and deserialize
This allows naturally for larger than memory shuffles but introduces an unnecessary de-/serialization and disk IO step if the data fits (mostly) into memory.
The current implementation of the disk buffer ensures that the writing to disk is batched such that we do not suffer too large disk IO overhead but it requires to be flushed before we can read back any data.
To allow more efficient shuffling for dataset that fit mostly into memory we would like to avoid writing all of the data to disk.
Additional notes
As an intermediate step it is very likely easier to keep the "redundant" serialization step as is and instead allow for a DiskShardsBuffer.read
to consolidate data that's on disk and memory.
Avoiding the serialization step entirely for the in-memory case would require us to delay this serialization step until we are actually writing to disk. This is not impossible but needs to be implemented with some care to avoid blocking the event loop.