In the past, nowcasting_dataset was designed to feed data on-the-fly into a PyTorch model. Which meant that, as the data flowed through nowcasting_dataset, the data would change type: For example, satellite data would start as an xr.DataArray, then get turned into a numpy array (because PyTorch doesn't know what to do with an xr.DataArray), and then get turned into a torch.Tensor.
But, I think we can safely say now that nowcasting_dataset is just for pre-preparing batches (not for loading data on-the-fly). As such, we can probably simplify the code by keeping data in a single container type per modality. For example, satellite data could always live in an xr.DataArray for its entire life while flowing through nowcasting_dataset.
Sorry, I really should've thought of this earlier! But, yeah, I think this could simplify the code quite a lot.
I haven't fully thought through the implications of this, but some changes might be:
- In the Pydantic models, each field can be just a single type (instead of a
Union of types). So, for example, instead of sat_data: Array = Field(... we can just do sat_data: xr.DataArray = Field(...
- We can get rid of the
to_numpy function.
- For all the modalities which use xarray of pandas data types, we can use dimension names instead of indexes. e.g.
seq_length = len(sat_data[-4]) becomes seq_length = len(sat_data.time)
- Saving and load data to/from disk becomes super-simple.
- We'd no longer need the
to_xr_dataset and from_xr_dataset methods (which are quite fiddly)