-
Notifications
You must be signed in to change notification settings - Fork 7.1k
[NOMERGE] Review Features prototype #5379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NOMERGE] Review Features prototype #5379
Conversation
💊 CI failures summary and remediationsAs of commit 72d7077 (more details on the Dr. CI page):
🕵️ 1 new failure recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
Job | Step | Action |
---|---|---|
Lint Python code and config files | 🔁 rerun |
This comment was automatically generated by Dr. CI (expand for details).
Please report bugs/suggestions to the (internal) Dr. CI Users group.
@@ -4,3 +4,5 @@ | |||
from ._image import ColorSpace, Image | |||
from ._label import Label, OneHotLabel | |||
from ._segmentation_mask import SegmentationMask | |||
|
|||
# We put lots of effort on Video this half. We will need to figure out video tensors as well in this prototype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The question that we need to answer is if there are video transforms that work with the temporal information. This would require the video feature that always have at least 4 dimensions like (time, channels, height, width)
. If this is not required, we can simply treat a video as a batch of images and thus use the Image
feature for it. cc @bjuncek
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends. Some video-specific transforms can potentially use the temporal dimension but for now those that we support don't. So the API should limit us from supporting such transforms but the low-level kernels must support Videos (I believe that's the case now for the vast majority of them; worth confirming).
Note that Videos will have 5 dimensions if you include the batch. This is why many tensor low-level kernels currently implement the transforms using negative indexes (-1, -2, -3) for C, H, W. It is on the current roadmap of TorchVision to improve video support, so that's something we need to factor in.
cc @vfdev-5 for visibility on which transforms need to be adjusted to support videos.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can link this issue #2583 where it was an attempt to make transforms to support video data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would require the video feature that always have at least 4 dimensions
In principle, it's not outlandish to expect video tensors to have (at least) 4 dimensions, with fixed expectation for C,H,W being the last three in order to avoid having to re-implement things with negative indexes.
It depends. Some video-specific transforms can potentially use the temporal dimension but for now those that we support don't. So the API should limit us from supporting such transforms but the low-level kernels must support Videos
I'm not sure I follow. There are video transforms (temporal jittering, subsampling, ...) that we should be able to support. In principle, I wouldn't expect torchvision to support every possible scenario, but it would be helpful if basic transforms for video worked on an arbitrary dimensional (i.e. [..., C, H, W]
) tensor. In that way, if user wants to implement some obscure temporal transform, they can implement something operating on the T dimension (so, -4) in a way that follows common API for a video feature (or batch image feature, whatever will be easiest). Would the current implementation make that troublesome?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bjuncek I think we agree on the approach. Having the Videos stored as [..., T, C, H, W]
allows us to implement most of the basic transforms with Video support. We should investigate which of the current transforms support this and which don't. Also how about the batch dimension aka [B, T, C, H, W]
; any thoughts for handling that on the new API?
I'm not sure I follow.
I just wanted to bring up the fact that it's not a given that ALL transforms will operate independently on the temporal dimension. So on the new API, there might be the need to support eventually a transform that operates on (T, C, H, W) jointly. Just mentioning this so that we don't limit in on the API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a Video
feature with [..., T, C, H, W]
will give us the most flexibility. All transforms that don't need the temporal information can simply use the corresponding image kernel that expects [..., C, H, W]
. At the same time, if we later have transforms that need the temporal information, they are automatically supported.
Adding comments to the features classes.