Skip to content
This repository was archived by the owner on Sep 11, 2023. It is now read-only.

Implement a thin "data loading" layer to help ML training #97

Closed
Tracked by #213
JackKelly opened this issue Sep 7, 2021 · 1 comment
Closed
Tracked by #213

Implement a thin "data loading" layer to help ML training #97

JackKelly opened this issue Sep 7, 2021 · 1 comment
Labels
data New data source or feature; or modification of existing data source enhancement New feature or request

Comments

@JackKelly
Copy link
Member

JackKelly commented Sep 7, 2021

The bulk of nowcasting_dataset is about saving pre-prepared batches of data to disk. But it feels like there's perhaps a need for a set of simple tools to help loading those pre-prepared batches into ML models during training. This has come up a few times in other issues:

Tasks like subsetting the data should be done as upstream as possible, so we only load from disk the data we want.

Tasks like data augmentation could perhaps be done in PyTorch 'transforms'?

@JackKelly JackKelly added enhancement New feature or request data New data source or feature; or modification of existing data source labels Sep 7, 2021
@JackKelly JackKelly added this to the WP1 essential tasks milestone Sep 7, 2021
jacobbieker added a commit that referenced this issue Sep 14, 2021
jacobbieker added a commit that referenced this issue Sep 14, 2021
jacobbieker added a commit that referenced this issue Sep 17, 2021
* Add customizable required keys

* Add customizable required keys

* Add TODO ideas relating to #97

* Start on subsetting data

* Add customizable required keys

* Add customizable required keys

* Add TODO ideas relating to #97

* Start on subsetting data

* Readd required keys

* Run black formatting

* Add init

* Add subsetting temporal data

* Rename to better reflect current index

* Add some higher imports

* Add Example

* Fix circular import, subset time sin/cos etc.

* Remove duplicated file

* Update constants

* Run black

* Remove todo

* Split out subselecting into own function

* Move datetime feature names to required_keys

* Add check for required keys

* Add unittest for subselect_data

* Update docstring

* Add docstring

* Update version

* Import more constants

* Add 30 second explanation

* Remove extra checks, PR comment

* Update subselect_data for xarray

* Update with simpler version

* Passing test

* Make test shorter, add test file

* Update nowcasting_dataset/dataset/datasets.py

Co-authored-by: Jack Kelly <[email protected]>

* Update nowcasting_dataset/dataset/datasets.py

Co-authored-by: Jack Kelly <[email protected]>

* Update nowcasting_dataset/dataset/datasets.py

Co-authored-by: Jack Kelly <[email protected]>

* Reduce code duplication in subselect

* Change how Datetimes selected

* Fix positional arg

* Simplify a bit further

* Simplify a bit further

* Fix error

Co-authored-by: Jack Kelly <[email protected]>
@peterdudfield peterdudfield removed this from the WP1 essential tasks milestone Sep 24, 2021
@flowirtz flowirtz moved this to Todo in Nowcasting Oct 15, 2021
@JackKelly
Copy link
Member Author

This is implemented by nowcasting_dataloader so I'll close this issue for now (unless I've misunderstood?!)

Repository owner moved this from Todo to Done in Nowcasting Oct 22, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
data New data source or feature; or modification of existing data source enhancement New feature or request
Projects
No open projects
Status: Done
Development

No branches or pull requests

2 participants