A key limitation of current architecture seems to be the reliance on pandas of the input, which limites useability in large data cases.
While torch with appropriate backends should be able to handle large data, pandas as a container choice, in particular the current instantiation which seems to rely on in-memory, will prove to be the bottleneck.
We should therefore consider and implement support for data backends that scale better, such as polars, dask, or spark, and see how easy it is to get the pandas pyarrow integration to work.
Architecturally, I think we should:
- build a more abstract data loader layer
- make
pandas one of multiple potential data soft dependencies
- try to prioritize the solution that would provide us with the quickest "impact for time invested"
The key entry point for this extension or refactor is TimeSeriesDataSet, which requires pandas objects to be passed.