fcoll/vulcan: add two_phase read_all #12894
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
add an implementation of the read_all operation that uses the two-phase I/O algorithm using even partitioning, i.e. the same base idea that is used by the write_all operation of the vulcan component. Until now, all components have been using the same 'generic' read_all code, which was based on the fcoll/dynamic module idea.
In addition to using the 'correct' data partitioning approach for the component, the vulcan read_all implementation also adds some other features that were there for the write_all operations, but not for the (generic) read_all algorithm used by all components so far. Specifically, it can overlap the execution of the I/O phase and the communication phase. The algorithm can also use GPU buffers for aggregation.
The code has been tested with:
The PR looks complicated, but its actually not that bad. The first two commits perform some code cleanup and reorganization of the fcoll_vulcan_file_write_all function, with the specific goal of simplifying the code and allowing to reuse big chunks of the code for the read_all operation. The third commit adds than the actual new algorithm code for read_all.
As a side note, I noticed one more issue in the code base regarding the registration/deregistration of the ompio progress function, but I will fix that after this PR is merged.