Discussion: For testing, should we use "fake" data or a small amount of  real data?

(Let's not worry about this now... just making a note to discuss in early 2022!)

As we all know, in order for "fake" data to be useful for testing, the "fake" data needs to accurately capture almost all of the structure of "real" data.  Otherwise the "fake" data could drive us to reach incorrect conclusions when debugging and testing our code (as happened when [debugging the OpticalFlowDatasource tests](https://github.com/openclimatefix/nowcasting_dataset/pull/314#issuecomment-979405819)).

Creating really "realistic" fake data is probably quite a lot of effort (for example, see issue #511).

I suppose I'm curious whether it might actually be _less_ work to use a small amount of _real_ data for testing, _instead_ of maintaining code to create "fake" data on the fly?  And include this sample of real data in the [`nowcasting_dataset/tests/data/`](https://github.com/openclimatefix/nowcasting_dataset/tree/main/tests/data) folder?

Strictly speaking, we're not allowed to share some of our data sources.  Maybe it wouldn't be too much work to obfuscate a small amount of "real" data (e.g. PV locations could be the LSOA locations that we're allowed to share publicly. And, for other data sources, we could add a small amount of random noise to all the data?)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Discussion: For testing, should we use "fake" data or a small amount of real data? #512

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Discussion: For testing, should we use "fake" data or a small amount of real data? #512

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions