-
Notifications
You must be signed in to change notification settings - Fork 7.1k
Recent changes to transforms v2 #7384
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The repo description says: |
We've had many of these discussions already @vadimkantorov . We're aware you are interested in having support for color-space conversions, and we have explained our reasoning for not doing so many times. Our position hasn't changed since #4029. Every feature addition is a matter of trade-offs, and color-spaces don't make the cut at this time. I don't think we'll add a sentence to the torchvision description just to clarify that its goal is not to support every single popular image transformation out there. |
I remember your position. I'm only asking for putting a bit more elaborate view of torchvision conception / plans in the broader sense in the README (even outside of discussion about color space trnasforms) :) That's all. I'm sure these discussions were already conducted and you have such plans, so it's more about putting these plans / roadmap / vision more visible / discoverable in the README :) |
Uh oh!
There was an error while loading. Please reload this page.
After the initial publication of the blog post for transforms v2, we made some changes to the API:
Feature
toDatapoint
and changed the namespace fromtorchvision.features
totorchvision.datapoints
accordingly.datapoints.Image
,datapoints.Video
,PIL.Image.Image
) in the input sample, all other plain tensors are passed through. If there is no explicit image or video, only the first plain tensor will be treated as an image. The order is defined by traversing depth-first through the input sample, which is compatible with all torchvision datasets, and should also work well for the vast majority of datasets out there.color_space
metadata fromdatapoints.Image
anddatapoints.Video
as well as the generalConvertColorSpace
conversion transform and corresponding functionals. This was done for three reasons:Grayscale
andRandomGrayscale
and so far they seem to be sufficient. Apart fromConvertColorSpace
, no other transform in v2 relied on the attribute. We acknowledge that of course there are use cases for color space conversions in a general CV library, but that doesn't apply totorchvision
.torchvision.io
offers theImageReadMode
enum, which handles this all on the C level with the highly optimized routines of the decoding libraries we build against.Normalize
being the most prominent here, returned plain tensors instead of returningdatapoint.Image
’s ordatapoint.Video
’s. We dropped that in favor of preserving the original type everywhere (i.e.they now returndatapoint.Image
’s ordatapoint.Video
’s), for two reasons:[0, 1]
can now be in an arbitrary range, e.g.[-2, 3]
. By returning a tensor instead of an Image, we wanted to convey the sense that it’s not clear whether the image is still RGB. However, we realized that this didn’t add any security layer, since plain tensors fall back to being transformed as image or video anyway. On top of that, while a lot of transform make an assumption about the range of an image (0-1, 0-255), this assumption is embeded in the dtype of the image, not its type. Returning tensors would only change the type, not the dtype, and so wouldn’t prevent the assumption from being applied anyway.F.affine
,F.rotate
,F.perspective
,F.elastic
) as well as cropping (F.crop
), now clamp bounding boxes before returning them. Note that this does not remove bounding boxes that are fully outside the image. See the next point for that.SanitizeBoundingBox
transform that removes degenerate bounding boxes, for example bounding boxes that are fully outside the image after cropping, as well as the corresponding labels and optionally masks. It should be sufficient to have a single of these transforms at the end of the pipeline, but it can also be used multiple times throughout. This sanitization was removed from transformations that previously had it builtin, e.g.RandomIoUCrop
.None of the above should affect the UX in a negative way. Unfortunately, there are also a few things that didn't make the initial cut:
RandomCutMix
,RandomMixUp
, andSimpleCopyPaste
all operate on a batch of samples. This doesn't fit the canonical way of passing the transforms to thetorchvision.dataset
's since they will be applied on a per-sample level. Thus, they will have to be used after batching is done, either in a custom collation function or separate from the data loader. In any case, using the default collation function loses the datapoint subclass and thus the sample needs to be rewrapped before being passed into the transform. In their current state, these transforms barely improved the current workflow (i.e. relying on the implementation in our training references). We’re trying to come up with a significant workflow improvement before releasing these transforms to a wide range of users.FixedSizedCrop
is the same asRandomCrop
, but with a slightly different padding strategy in case the crop size is larger than the input. Although it is a 1-to-1 replica from a research paper, we feel it makes little sense to have both at the same time. SinceRandomCrop
is already present in v1, we kept it. Note that similar toRandomIoUCrop
,FixedSizeCrop
had the bounding box sanitization builtin, whileRandomCrop
does not.datapoints.Label
anddatapoints.OneHotLabel
: These datapoints were needed forRandomCutMix
andRandomMixUp
as well as for the sanitization behavior ofRandomIoUCrop
andFixedSizeCrop
. Since we are not releasing the former just yet and the new fallback heuristic allows us to pass plain tensors as images, the label datapoints currently don't have a use case. Another reason we removed the Label class is that it was not really clear whether a label referred to adatapoint.BoundingBox
, adatapoint.Mask
, or andatapoint.Image
- there was nothing that structurally enforces that in our API. So each transform would make its own assumption about what the labels correspond to, and that could quickly lead to conflicts. Instead we have decided to remove the Label class altogether and to always pass-through labels (as plain tensors) in all transforms. The assumption about “what does the label correspond to” is now encapsulated in theSanitizeBoundingBox
transform, which lets users manually specify the mapping. This avoids all other transforms having to make assumptions about whether they should be transforming the labels or not and simplifies the mental model.PermuteDimensions
,TransposeDimensions
, and thetemporal_dim
parameter onUniformTemporalSubsample
: These were introduced to improve the UX for video users, since the transformations expect videos to be in format*TCHW
, while our models expectCTHW
. However, this violates the assumptions regarding the format that we make for all transformations. Meaning, these transformations can only ever come at the end of a pipeline. Thus, we require users to callvideo.transpose(-4, -3)
for now.To be clear, we didn't remove this functionality. It is still available under
torchvision.prototype
. We want to add the batch transformations to the API, but haven't figured out a way to do it without making the API inconsistent in general. The others are less clear and need a more general discussion first. Please stay tuned on any updates here.cc @vfdev-5
The text was updated successfully, but these errors were encountered: