-
Notifications
You must be signed in to change notification settings - Fork 7.1k
[NOMERGE] Review Features prototype #5379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
datumbox
wants to merge
1
commit into
pytorch:revamp-prototype-features-transforms
from
datumbox:review_features
Closed
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The question that we need to answer is if there are video transforms that work with the temporal information. This would require the video feature that always have at least 4 dimensions like
(time, channels, height, width)
. If this is not required, we can simply treat a video as a batch of images and thus use theImage
feature for it. cc @bjuncekThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends. Some video-specific transforms can potentially use the temporal dimension but for now those that we support don't. So the API should limit us from supporting such transforms but the low-level kernels must support Videos (I believe that's the case now for the vast majority of them; worth confirming).
Note that Videos will have 5 dimensions if you include the batch. This is why many tensor low-level kernels currently implement the transforms using negative indexes (-1, -2, -3) for C, H, W. It is on the current roadmap of TorchVision to improve video support, so that's something we need to factor in.
cc @vfdev-5 for visibility on which transforms need to be adjusted to support videos.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can link this issue #2583 where it was an attempt to make transforms to support video data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In principle, it's not outlandish to expect video tensors to have (at least) 4 dimensions, with fixed expectation for C,H,W being the last three in order to avoid having to re-implement things with negative indexes.
I'm not sure I follow. There are video transforms (temporal jittering, subsampling, ...) that we should be able to support. In principle, I wouldn't expect torchvision to support every possible scenario, but it would be helpful if basic transforms for video worked on an arbitrary dimensional (i.e.
[..., C, H, W]
) tensor. In that way, if user wants to implement some obscure temporal transform, they can implement something operating on the T dimension (so, -4) in a way that follows common API for a video feature (or batch image feature, whatever will be easiest). Would the current implementation make that troublesome?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bjuncek I think we agree on the approach. Having the Videos stored as
[..., T, C, H, W]
allows us to implement most of the basic transforms with Video support. We should investigate which of the current transforms support this and which don't. Also how about the batch dimension aka[B, T, C, H, W]
; any thoughts for handling that on the new API?I just wanted to bring up the fact that it's not a given that ALL transforms will operate independently on the temporal dimension. So on the new API, there might be the need to support eventually a transform that operates on (T, C, H, W) jointly. Just mentioning this so that we don't limit in on the API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a
Video
feature with[..., T, C, H, W]
will give us the most flexibility. All transforms that don't need the temporal information can simply use the corresponding image kernel that expects[..., C, H, W]
. At the same time, if we later have transforms that need the temporal information, they are automatically supported.