ResNet strikes back: What about training set? #903

alexander-soare · 2021-10-04T18:05:33Z

alexander-soare
Oct 4, 2021

Not a follow up but related to #901

I wonder what would happen if there was some other subset of ImageNet with different images and perhaps even different classes. Would the architecture and training recipe need to be jointly optimized with that as well? Would we then be searching for?:

(v1)

where I introduce $\mathcal{D}$ as the dataset.

And if we want to argue that no, then to what extent is the problem separable?:

(v2)

In the paper you do validate on different variants of image, but this may be entirely different to training on different datasets.

Ideally, we'd hope that v2 is closer to reality, because that might be expected to bode well for the applicability to down-stream tasks with different data domains (and this relates to the discussion I linked at the top).

Answered by rwightman

Oct 5, 2021

@alexander-soare Unfortunately I don't believe that is the case. I believe the general outline of the featured recipes would work well for many combinations of arch + dataset, there would need to be adjustment for optimal outcome on a different dataset, just as you would for a different architecture. We didn't test that explicity here, but I have observed that in the past, and I've had conversations with others who have.

Thinking of concrete examples I have in my head, for the 'How to train your ViT' paper, the pretraining recipe for the imagenet21k was lower aug than the optimal for 1k from scratch. The observation there was that increased augreg could roughly make up for an order of mag…

View full answer

rwightman · 2021-10-05T20:35:38Z

rwightman
Oct 5, 2021
Maintainer

@alexander-soare Unfortunately I don't believe that is the case. I believe the general outline of the featured recipes would work well for many combinations of arch + dataset, there would need to be adjustment for optimal outcome on a different dataset, just as you would for a different architecture. We didn't test that explicity here, but I have observed that in the past, and I've had conversations with others who have.

Thinking of concrete examples I have in my head, for the 'How to train your ViT' paper, the pretraining recipe for the imagenet21k was lower aug than the optimal for 1k from scratch. The observation there was that increased augreg could roughly make up for an order of magnitude in dataset size. I went nuts at one point in that exploration and trained PETS for what would be the equivalent of 700 imagenet epochs in terms of images seen, but with VERY high augreg (esp aug). This was for a vit model, but it actually kept learning and started getting closer to transfer or from scratch CNN results (much faster to learn from smaller data). That didn't make the paper cut.

Other factors I could see having an impact, number of classes, noiseyness of the labels, statistics of the images themselves, how similar are they? Are the features that differentiate the classes coarse or fine-grained? Are the classes balanced?

1 reply

alexander-soare Oct 6, 2021
Author

Hmmm. Lots to think about. And I saw your response on the other question, thanks for that and congrats again!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ResNet strikes back: What about training set? #903

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

ResNet strikes back: What about training set? #903

Uh oh!

Uh oh!

alexander-soare Oct 4, 2021

Replies: 1 comment · 1 reply

Uh oh!

rwightman Oct 5, 2021 Maintainer

Uh oh!

alexander-soare Oct 6, 2021 Author

alexander-soare
Oct 4, 2021

Replies: 1 comment 1 reply

rwightman
Oct 5, 2021
Maintainer

alexander-soare Oct 6, 2021
Author