ResNet strikes back: What about training set? #903
-
Not a follow up but related to #901 I wonder what would happen if there was some other subset of ImageNet with different images and perhaps even different classes. Would the architecture and training recipe need to be jointly optimized with that as well? Would we then be searching for?: where I introduce And if we want to argue that no, then to what extent is the problem separable?: In the paper you do validate on different variants of image, but this may be entirely different to training on different datasets. Ideally, we'd hope that v2 is closer to reality, because that might be expected to bode well for the applicability to down-stream tasks with different data domains (and this relates to the discussion I linked at the top). |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
@alexander-soare Unfortunately I don't believe that is the case. I believe the general outline of the featured recipes would work well for many combinations of arch + dataset, there would need to be adjustment for optimal outcome on a different dataset, just as you would for a different architecture. We didn't test that explicity here, but I have observed that in the past, and I've had conversations with others who have. Thinking of concrete examples I have in my head, for the 'How to train your ViT' paper, the pretraining recipe for the imagenet21k was lower aug than the optimal for 1k from scratch. The observation there was that increased augreg could roughly make up for an order of magnitude in dataset size. I went nuts at one point in that exploration and trained PETS for what would be the equivalent of 700 imagenet epochs in terms of images seen, but with VERY high augreg (esp aug). This was for a vit model, but it actually kept learning and started getting closer to transfer or from scratch CNN results (much faster to learn from smaller data). That didn't make the paper cut. Other factors I could see having an impact, number of classes, noiseyness of the labels, statistics of the images themselves, how similar are they? Are the features that differentiate the classes coarse or fine-grained? Are the classes balanced? |
Beta Was this translation helpful? Give feedback.
@alexander-soare Unfortunately I don't believe that is the case. I believe the general outline of the featured recipes would work well for many combinations of arch + dataset, there would need to be adjustment for optimal outcome on a different dataset, just as you would for a different architecture. We didn't test that explicity here, but I have observed that in the past, and I've had conversations with others who have.
Thinking of concrete examples I have in my head, for the 'How to train your ViT' paper, the pretraining recipe for the imagenet21k was lower aug than the optimal for 1k from scratch. The observation there was that increased augreg could roughly make up for an order of mag…