[ENH] enable large data use cases - decouple data input from `pandas`, allow `polars`, `dask`, and/or `spark`

A key limitation of current architecture seems to be the reliance on `pandas` of the input, which limites useability in large data cases.

While `torch` with appropriate backends should be able to handle large data, `pandas` as a container choice, in particular the current instantiation which seems to rely on in-memory, will prove to be the bottleneck.

We should therefore consider and implement support for data backends that scale better, such as `polars`, `dask`, or `spark`, and see how easy it is to get the `pandas` `pyarrow` integration to work.

Architecturally, I think we should:

* build a more abstract data loader layer
* make `pandas` one of multiple potential data soft dependencies
* try to prioritize the solution that would provide us with the quickest "impact for time invested"

The key entry point for this extension or refactor is `TimeSeriesDataSet`, which requires `pandas` objects to be passed.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[ENH] enable large data use cases - decouple data input from `pandas`, allow `polars`, `dask`, and/or `spark` #1685

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[ENH] enable large data use cases - decouple data input from pandas, allow polars, dask, and/or spark #1685

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[ENH] enable large data use cases - decouple data input from `pandas`, allow `polars`, `dask`, and/or `spark` #1685