Skip to content

[Parquet] ArrowWriter flush does not work #8534

@PiotrSrebrny

Description

@PiotrSrebrny

When working with ArrowWriter I would like to flush buffered rows onto the disk. However, when calling ArrowWriter<W>::flush() only part of the data is flushed. The reason is that parquet::file::writer::TrackedWrite that is used by ArrowWriter inserts BufWriter on top of user supplied writer W. This BufWriter is not flushed() when ArrowWriter<W>::flush() is called.

The best solution to this problem would be to remove BufWriter from TrackedWrite and just use the user supplied Writer. The BufWriter suppose to buffer small writes, but this function is not needed when writing to memory and most operating systems employ this sort of mechanism. Thus, it is redundant. Maybe, BufWriter could be beneficial when working with bare-metal system, but then a user could just wrap its writer in BufWriter and give it to ArrowWriter. Nonetheless, I guess that DataFusion is not ofter run on bare-metal.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions