Skip to content

Add option for output tables to be written as parquet files.  #762

@stefancoe

Description

@stefancoe

Is your feature request related to a problem? Please describe.
Current options include csv and h5. Parquet offers a better file based option when compared to csv, for both size and speed. PSRC's trip table as a csv is 2,100,000 KB compared to 459,700 KB as a parquet file. Loading the csv file into a Pandas Dataframe on my laptop takes 20 seconds compared to 3 seconds as a parquet file. Activitysim is already using parquet to store pipeline files.

Describe the solution you'd like
Currently, there is a config setting called 'h5_store', that uses h5 when set to True and csv when set to False or not included. So csv is the default. I propose adding a setting called 'file_type' that would allow 3 options: 'csv', 'h5', or 'parquet'. Its default would also be 'csv'. The h5_store setting would remain and its current expected behavior would be unchanged. The behavior of these settings would work like so:

  • When h5_store is set to True outputs are written out to h5.
  • When h5_store is set to False (default) and file_type is not specified, outputs are written as .csv
  • When h5_store is set to False (default) and file_type is specified, outputs are written out to its setting: csv, parquet or h5.
  • file_type is validated against allowed values (csv, parquet, h5) using pydantic. Activitysim will crash with a useful error message almost immediately if this setting is included with a wrong value.

Describe alternatives you've considered
Another option would be to add a boolean setting like use_parquet, but conflicts would arise if both settings were to set to True in a config file. If this request is accepted and we go with file_type, it may make sense to deprecate the h5_store setting at some point, especially if even more file types are supported in the future.

Additional context
I have made these changes on a fork and will issue a pull request.

Metadata

Metadata

Assignees

No one assigned

    Labels

    FeatureNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions