Skip to content

Make transforms work on video tensor inputs or batch of images #2583

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
vfdev-5 opened this issue Aug 13, 2020 · 10 comments
Closed

Make transforms work on video tensor inputs or batch of images #2583

vfdev-5 opened this issue Aug 13, 2020 · 10 comments

Comments

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Aug 13, 2020

🚀 Feature

Following #2292 (comment) and discussion with @fmassa and @bjuncek , this feature request is to improve the transforms module to support video inputs as torch.Tensor of shape (C, T, H, W), where C - number of image channels (e.g. 3 for RGB), T - number of frames, H, W - image dimensions.

Points to discuss:

  • Convention for geometric transforms: 2 last dimensions are H, W ?
  • Convention for color transforms: 1 dimension is color, like above (C, T, H, W) ?
    • should we also support (T, C, H, W) ?
@joakimjohnander
Copy link

joakimjohnander commented Aug 20, 2020

The representation (T, C, H, W) is convenient when processing one frame at a time, such as in object tracking. I would therefore advocate its support.

@fmassa
Copy link
Member

fmassa commented Aug 21, 2020

Convention for geometric transforms: 2 last dimensions are H, W ?

Yes, I think we should assume that the two last dimensions are H and W. We will be leveraging in the future memory_format to account for channels-last, but the shape will still be the same

Convention for color transforms:

My first reaction would be that we should assume that the channels are the dim=-3. This way, we don't need to special-case for videos. And if the user needs their video in a different format (C, T, H, W for example) they can apply a permutation to it.

Another option (which I would let out for now) is to allow for a dim argument to the color transforms, so that the user can specify which dimension we should use for color. But I'm not necessarily in favor of this approach for now, as it might add more complexity to the user, and we should just assume a good default.

adding @takatosp1 @tullie for thoughts

@bjuncek
Copy link
Contributor

bjuncek commented Aug 21, 2020

My first reaction would be that we should assume that the channels are the dim=-3. This way, we don't need to special-case for videos. And if the user needs their video in a different format (C, T, H, W for example) they can apply a permutation to it.

I feel like the annoying part about this is that the toTensorVideo transforms ffmpeg output (T, H, W, C) to (T, C, H, W), so we should either apply some version of toTensor in the video reading part, bc otherwise we'd have to have sandwich of permutations in transforms.

@fmassa
Copy link
Member

fmassa commented Aug 21, 2020

Everything under _transforms_video.py is private, including ToTensorVideo, so I would not worry about breaking BC for this. We don't actually want to keep those functions around for long actually

@tullie
Copy link

tullie commented Aug 22, 2020

Your logic for (T, C, H, W) makes sense, particularly for transforms. Problem is - any video model with 3D convolutions is likely going to want T, H, W in the last 3 dims. I guess it depends if you're prioritizing the best format for writing transforms or for writing common video models.

@haooooooqi
Copy link

I think it depends on what are the following "operations" after the transforms.
The advantage of T C H W is that we could use the same "image" operations efficiently.
The advantage of C T H W is that we could do operations like 3d conv more efficiently.
Currently I can see operations like spatial temporal crop (THWC preferred), color norm (CTHW preferred), color jitter (CTHW preferred), flip (does not matter). (in physical order)
From the efficiency perspective, we could probably use the order that provides the most efficient indexing.
From the implementation perspective, I generally find it would be nice if we don't need to do permute in the end.

@vfdev-5 vfdev-5 changed the title Make transforms work on video tensor inputs Make transforms work on video tensor inputs or batch of images Sep 3, 2020
@fmassa
Copy link
Member

fmassa commented Oct 21, 2020

@vfdev-5 I believe this can be closed now?

@vfdev-5
Copy link
Collaborator Author

vfdev-5 commented Oct 21, 2020

@fmassa I thought to close it once we updated video classification ref example, #2935. But as you wish, otherwise I can create another one for that update.

@fmassa
Copy link
Member

fmassa commented Oct 21, 2020

Ok, let's wait until we update the reference examples then before closing this one

@vfdev-5
Copy link
Collaborator Author

vfdev-5 commented Nov 16, 2020

Close the issue as #2935 has been merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants