Skip to content

Optional Disk for P2P shuffling #7572

@fjetter

Description

@fjetter

The P2P shuffling extension currently has a hard requirement on disk. Effectively the entire data is written to and read from disk at least once.

This works roughly by

  1. Serialize a batch of shards on the transferring side
  2. Deserialize batch of shards on the receiving side
  3. Group those shards by output partition
  4. Serialize those output partitions again
  5. Write the output partitions to disk
  6. Once required, read the output partition from disk again and deserialize

This allows naturally for larger than memory shuffles but introduces an unnecessary de-/serialization and disk IO step if the data fits (mostly) into memory.

The current implementation of the disk buffer ensures that the writing to disk is batched such that we do not suffer too large disk IO overhead but it requires to be flushed before we can read back any data.

To allow more efficient shuffling for dataset that fit mostly into memory we would like to avoid writing all of the data to disk.

Additional notes

As an intermediate step it is very likely easier to keep the "redundant" serialization step as is and instead allow for a DiskShardsBuffer.read to consolidate data that's on disk and memory.
Avoiding the serialization step entirely for the in-memory case would require us to delay this serialization step until we are actually writing to disk. This is not impossible but needs to be implemented with some care to avoid blocking the event loop.

cc @wence- @hendrikmakait

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions