-
Notifications
You must be signed in to change notification settings - Fork 7.1k
Description
NOTE: I am creating this issue as a discussion ground for the proposal.
Requirements
Given a dataset, we must be able to sample instance sets under certain constraints. For instance, given a dataset of images and their class labels, consider the following two samplings.
Sampling 1 - Sample a pair of images from two distinct classes or a pair of images from the same class.
Sampling 2 - Sample a set of k
images from the dataset along with another image to test this k-subset against (I'll spare what exactly what "testing" against means). The constraint applicable here is that the test image should be from a class which exists in the initially sampled k-subset. An alternative view would be to sample k+1
images from the dataset such that at least 2 images are from the same class and use one of those images as the test image.
If you are not convinced why the above kinds of samplings might be needed, I can provide references to representative literature.
Approach
Borrowing the idea from @fmassa 's comment at #323 , in similar spirit of the ConcatDataset
class, we must have another wrapper say MultiDataset
.
Tricky Parts
The above higher-order abstraction is a good approach, but a few challenges to generalize such a dataset are the following. Since, we would want to wrap around an existing dataset, we will require
standardization of member fields of the dataset classes. Especially for tasks where labels are involved. Or perhaps the dataset classes must also implement get_labels()
method which returns a list of labels and a get_label_instances()
which allows accessing instances for a particular label.
This seems like a not-so-clean approach and really looking for cleaner ideas. Perhaps I am missing something to cleanly implement this?