Skip to content

[ENH] enable large data use cases - decouple data input from pandas, allow polars, dask, and/or spark #1685

@fkiraly

Description

@fkiraly

A key limitation of current architecture seems to be the reliance on pandas of the input, which limites useability in large data cases.

While torch with appropriate backends should be able to handle large data, pandas as a container choice, in particular the current instantiation which seems to rely on in-memory, will prove to be the bottleneck.

We should therefore consider and implement support for data backends that scale better, such as polars, dask, or spark, and see how easy it is to get the pandas pyarrow integration to work.

Architecturally, I think we should:

  • build a more abstract data loader layer
  • make pandas one of multiple potential data soft dependencies
  • try to prioritize the solution that would provide us with the quickest "impact for time invested"

The key entry point for this extension or refactor is TimeSeriesDataSet, which requires pandas objects to be passed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions