Skip to content

Recent changes to transforms v2 #7384

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
pmeier opened this issue Mar 3, 2023 · 3 comments
Open

Recent changes to transforms v2 #7384

pmeier opened this issue Mar 3, 2023 · 3 comments

Comments

@pmeier
Copy link
Collaborator

pmeier commented Mar 3, 2023

After the initial publication of the blog post for transforms v2, we made some changes to the API:

  • We have renamed our tensor subclasses from Feature to Datapoint and changed the namespace from torchvision.features to torchvision.datapoints accordingly.
  • We have changed the fallback heuristic for plain tensors: previously any plain tensor input would be treated as an image and transformed as such. However, this was too limiting as it prohibited passing any non-image data as tensors to the transforms that in theory should just be passed through. The new heuristic goes as follows: if we find an explicit image or video (datapoints.Image, datapoints.Video, PIL.Image.Image) in the input sample, all other plain tensors are passed through. If there is no explicit image or video, only the first plain tensor will be treated as an image. The order is defined by traversing depth-first through the input sample, which is compatible with all torchvision datasets, and should also work well for the vast majority of datasets out there.
  • We have removed the color_space metadata from datapoints.Image and datapoints.Video as well as the general ConvertColorSpace conversion transform and corresponding functionals. This was done for three reasons:
    1. There is no apparent need for it. v1 comprises Grayscale and RandomGrayscale and so far they seem to be sufficient. Apart from ConvertColorSpace, no other transform in v2 relied on the attribute. We acknowledge that of course there are use cases for color space conversions in a general CV library, but that doesn't apply to torchvision.
    2. It is inefficient. Instead of reading an image in its native color space and converting it afterwards to the color space we want on the tensor level, torchvision.io offers the ImageReadMode enum, which handles this all on the C level with the highly optimized routines of the decoding libraries we build against.
  • Some transforms, with Normalize being the most prominent here, returned plain tensors instead of returning datapoint.Image’s or datapoint.Video’s. We dropped that in favor of preserving the original type everywhere (i.e.they now return datapoint.Image’s or datapoint.Video’s), for two reasons:
    1. Returning a tensor was originally chosen in order to add an extra layer of security: after the image is normalized, its range becomes non-standard and so an RGB image in [0, 1] can now be in an arbitrary range, e.g. [-2, 3]. By returning a tensor instead of an Image, we wanted to convey the sense that it’s not clear whether the image is still RGB. However, we realized that this didn’t add any security layer, since plain tensors fall back to being transformed as image or video anyway. On top of that, while a lot of transform make an assumption about the range of an image (0-1, 0-255), this assumption is embeded in the dtype of the image, not its type. Returning tensors would only change the type, not the dtype, and so wouldn’t prevent the assumption from being applied anyway.
    2. With the new fallback heuristic, this could even lead to problems when you have plain tensors before the explicit image or video in the sample.
  • Transformations that potentially partially or completely remove objects from the image, i.e. the affine transformations (F.affine, F.rotate, F.perspective, F.elastic) as well as cropping (F.crop), now clamp bounding boxes before returning them. Note that this does not remove bounding boxes that are fully outside the image. See the next point for that.
  • We introduced the SanitizeBoundingBox transform that removes degenerate bounding boxes, for example bounding boxes that are fully outside the image after cropping, as well as the corresponding labels and optionally masks. It should be sufficient to have a single of these transforms at the end of the pipeline, but it can also be used multiple times throughout. This sanitization was removed from transformations that previously had it builtin, e.g. RandomIoUCrop.

None of the above should affect the UX in a negative way. Unfortunately, there are also a few things that didn't make the initial cut:

  • Batch transformations: RandomCutMix, RandomMixUp, and SimpleCopyPaste all operate on a batch of samples. This doesn't fit the canonical way of passing the transforms to the torchvision.dataset's since they will be applied on a per-sample level. Thus, they will have to be used after batching is done, either in a custom collation function or separate from the data loader. In any case, using the default collation function loses the datapoint subclass and thus the sample needs to be rewrapped before being passed into the transform. In their current state, these transforms barely improved the current workflow (i.e. relying on the implementation in our training references). We’re trying to come up with a significant workflow improvement before releasing these transforms to a wide range of users.
  • FixedSizedCrop is the same as RandomCrop, but with a slightly different padding strategy in case the crop size is larger than the input. Although it is a 1-to-1 replica from a research paper, we feel it makes little sense to have both at the same time. Since RandomCrop is already present in v1, we kept it. Note that similar to RandomIoUCrop, FixedSizeCrop had the bounding box sanitization builtin, while RandomCrop does not.
  • datapoints.Label and datapoints.OneHotLabel: These datapoints were needed for RandomCutMix and RandomMixUp as well as for the sanitization behavior of RandomIoUCrop and FixedSizeCrop. Since we are not releasing the former just yet and the new fallback heuristic allows us to pass plain tensors as images, the label datapoints currently don't have a use case. Another reason we removed the Label class is that it was not really clear whether a label referred to a datapoint.BoundingBox, a datapoint.Mask, or an datapoint.Image - there was nothing that structurally enforces that in our API. So each transform would make its own assumption about what the labels correspond to, and that could quickly lead to conflicts. Instead we have decided to remove the Label class altogether and to always pass-through labels (as plain tensors) in all transforms. The assumption about “what does the label correspond to” is now encapsulated in the SanitizeBoundingBox transform, which lets users manually specify the mapping. This avoids all other transforms having to make assumptions about whether they should be transforming the labels or not and simplifies the mental model.
  • PermuteDimensions, TransposeDimensions, and the temporal_dim parameter on UniformTemporalSubsample: These were introduced to improve the UX for video users, since the transformations expect videos to be in format *TCHW, while our models expect CTHW. However, this violates the assumptions regarding the format that we make for all transformations. Meaning, these transformations can only ever come at the end of a pipeline. Thus, we require users to call video.transpose(-4, -3) for now.

To be clear, we didn't remove this functionality. It is still available under torchvision.prototype. We want to add the batch transformations to the API, but haven't figured out a way to do it without making the API inconsistent in general. The others are less clear and need a more general discussion first. Please stay tuned on any updates here.

cc @vfdev-5

@vadimkantorov
Copy link

vadimkantorov commented Mar 13, 2023

  1. We acknowledge that of course there are use cases for color space conversions in a general CV library, but that doesn't apply to torchvision.

The repo description says: The torchvision package consists of popular datasets, model architectures, and common image transformations for computer vision.. Color transformations are IMO certainly very "common image transformations for computer vision". It would be nice if the repo description was expanded to reflect the owning team's current conception about what torchvision is/should be and what it should not be :) It would have saved many out-of-scope discussions :) This question was rised many times, and some brain dump in the repo description on how torchvision positions itself in pytorch computer vision frameworks/libraries ecosystem would be useful IMO.

@NicolasHug
Copy link
Member

NicolasHug commented Mar 14, 2023

We've had many of these discussions already @vadimkantorov . We're aware you are interested in having support for color-space conversions, and we have explained our reasoning for not doing so many times. Our position hasn't changed since #4029.

Every feature addition is a matter of trade-offs, and color-spaces don't make the cut at this time. I don't think we'll add a sentence to the torchvision description just to clarify that its goal is not to support every single popular image transformation out there.

@vadimkantorov
Copy link

vadimkantorov commented Mar 14, 2023

I remember your position. I'm only asking for putting a bit more elaborate view of torchvision conception / plans in the broader sense in the README (even outside of discussion about color space trnasforms) :) That's all. I'm sure these discussions were already conducted and you have such plans, so it's more about putting these plans / roadmap / vision more visible / discoverable in the README :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants