Implement a thin "data loading" layer to help ML training #97

JackKelly · 2021-09-07T19:34:13Z

The bulk of nowcasting_dataset is about saving pre-prepared batches of data to disk. But it feels like there's perhaps a need for a set of simple tools to help loading those pre-prepared batches into ML models during training. This has come up a few times in other issues:

As mentioned in Use NamedTensors to name dimensions #25, perhaps write an xarray accessor function so we can go straight from xarray to pytorch named tensors.
Train on subsets of the pre-prepared batch. e.g.
- Option to change the length of history data to show to the model. e.g. always save 1 hour of history to disk. But then have a simple param to reduce that to, say, half an hour for some ML experiments (suggested by @jacobbieker in Add Rainfall Radar data #80)
- Always save large satellite images (see Use 'large crop' of satellite image to provide a wider geographical context #87) with every example, but provide the option to pick just the centre crop during ML training. Or to down-sample the large image and provide native-res centre crop. (Also see https://github.com/openclimatefix/predict_pv_yield/issues/65)
Maybe explore spatial reprojection on-the-fly (almost certainly too expensive; but might be worth a try). See Spatially reproject satellite data within nowcasting_dataset #92
Standard way to load batches into a PyTorch DataLoader.
Data augmentation (like flipping the image, see #78)
Position encoding (e.g. converting the latitude & longitude of each PV system to an encoding that's ready for use in a fully-attentional model). See Encode "real world" positions of the input data perceiver-pytorch#9

Tasks like subsetting the data should be done as upstream as possible, so we only load from disk the data we want.

Tasks like data augmentation could perhaps be done in PyTorch 'transforms'?

The text was updated successfully, but these errors were encountered:

* Add customizable required keys * Add customizable required keys * Add TODO ideas relating to #97 * Start on subsetting data * Add customizable required keys * Add customizable required keys * Add TODO ideas relating to #97 * Start on subsetting data * Readd required keys * Run black formatting * Add init * Add subsetting temporal data * Rename to better reflect current index * Add some higher imports * Add Example * Fix circular import, subset time sin/cos etc. * Remove duplicated file * Update constants * Run black * Remove todo * Split out subselecting into own function * Move datetime feature names to required_keys * Add check for required keys * Add unittest for subselect_data * Update docstring * Add docstring * Update version * Import more constants * Add 30 second explanation * Remove extra checks, PR comment * Update subselect_data for xarray * Update with simpler version * Passing test * Make test shorter, add test file * Update nowcasting_dataset/dataset/datasets.py Co-authored-by: Jack Kelly <[email protected]> * Update nowcasting_dataset/dataset/datasets.py Co-authored-by: Jack Kelly <[email protected]> * Update nowcasting_dataset/dataset/datasets.py Co-authored-by: Jack Kelly <[email protected]> * Reduce code duplication in subselect * Change how Datetimes selected * Fix positional arg * Simplify a bit further * Simplify a bit further * Fix error Co-authored-by: Jack Kelly <[email protected]>

JackKelly · 2021-10-22T11:15:35Z

This is implemented by nowcasting_dataloader so I'll close this issue for now (unless I've misunderstood?!)

JackKelly added enhancement New feature or request data New data source or feature; or modification of existing data source labels Sep 7, 2021

JackKelly added this to the WP1 essential tasks milestone Sep 7, 2021

This was referenced Sep 7, 2021

Encode "real world" positions of the input data openclimatefix/perceiver-pytorch#9

Closed

Add Rainfall Radar data #80

Open

jacobbieker added a commit that referenced this issue Sep 14, 2021

Add TODO ideas relating to #97

60ac61f

jacobbieker added a commit that referenced this issue Sep 14, 2021

Add TODO ideas relating to #97

83c7fc5

peterdudfield removed this from the WP1 essential tasks milestone Sep 24, 2021

JackKelly mentioned this issue Oct 8, 2021

"Big new design" for nowcasting_dataset #213

Closed

38 tasks

flowirtz added this to Nowcasting Oct 15, 2021

flowirtz moved this to Todo in Nowcasting Oct 15, 2021

JackKelly closed this as completed Oct 22, 2021

Repository owner moved this from Todo to Done in Nowcasting Oct 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Implement a thin "data loading" layer to help ML training #97

Implement a thin "data loading" layer to help ML training #97

JackKelly commented Sep 7, 2021 •

edited

Loading

JackKelly commented Oct 22, 2021

Uh oh!

Uh oh!

Implement a thin "data loading" layer to help ML training #97

Implement a thin "data loading" layer to help ML training #97

Comments

JackKelly commented Sep 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

JackKelly commented Oct 22, 2021

Uh oh!

JackKelly commented Sep 7, 2021 •

edited

Loading