Skip to content

Dataset transforms to sample a set from data #338

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
activatedgeek opened this issue Nov 21, 2017 · 2 comments
Open

Dataset transforms to sample a set from data #338

activatedgeek opened this issue Nov 21, 2017 · 2 comments

Comments

@activatedgeek
Copy link
Contributor

NOTE: I am creating this issue as a discussion ground for the proposal.

Requirements

Given a dataset, we must be able to sample instance sets under certain constraints. For instance, given a dataset of images and their class labels, consider the following two samplings.

Sampling 1 - Sample a pair of images from two distinct classes or a pair of images from the same class.

Sampling 2 - Sample a set of k images from the dataset along with another image to test this k-subset against (I'll spare what exactly what "testing" against means). The constraint applicable here is that the test image should be from a class which exists in the initially sampled k-subset. An alternative view would be to sample k+1 images from the dataset such that at least 2 images are from the same class and use one of those images as the test image.

If you are not convinced why the above kinds of samplings might be needed, I can provide references to representative literature.

Approach

Borrowing the idea from @fmassa 's comment at #323 , in similar spirit of the ConcatDataset class, we must have another wrapper say MultiDataset.

Tricky Parts

The above higher-order abstraction is a good approach, but a few challenges to generalize such a dataset are the following. Since, we would want to wrap around an existing dataset, we will require
standardization of member fields of the dataset classes. Especially for tasks where labels are involved. Or perhaps the dataset classes must also implement get_labels() method which returns a list of labels and a get_label_instances() which allows accessing instances for a particular label.

This seems like a not-so-clean approach and really looking for cleaner ideas. Perhaps I am missing something to cleanly implement this?

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Dec 2, 2017

@activatedgeek did you advanced with this ? I was looking for something like that recently trying to reproduce one-shot learning evalutation on Omniglot (as your task I suppose) and to extend the approach to another dataset. Here is my code for same/different pairs dataset if you would to take a look. There is also a keras implementation doing the same stuff.

@activatedgeek
Copy link
Contributor Author

activatedgeek commented Dec 4, 2017

Hey @vfdev-5 , thank you for this. I was actually hoping that the core maintainers comment on this but I guess everybody is busy with NIPS. Building a generic wrapper would require some standardization as to how data sets are written in terms of the methods they expose. Your implementation is quite specific (which is in fact what I had done earlier as well in #323) but then later removed in interest of composition.

rajveerb pushed a commit to rajveerb/vision that referenced this issue Nov 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants