Skip to content

[NOMERGE] Review Features prototype #5379

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions torchvision/prototype/features/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,5 @@
from ._image import ColorSpace, Image
from ._label import Label, OneHotLabel
from ._segmentation_mask import SegmentationMask

# We put lots of effort on Video this half. We will need to figure out video tensors as well in this prototype
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question that we need to answer is if there are video transforms that work with the temporal information. This would require the video feature that always have at least 4 dimensions like (time, channels, height, width). If this is not required, we can simply treat a video as a batch of images and thus use the Image feature for it. cc @bjuncek

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends. Some video-specific transforms can potentially use the temporal dimension but for now those that we support don't. So the API should limit us from supporting such transforms but the low-level kernels must support Videos (I believe that's the case now for the vast majority of them; worth confirming).

Note that Videos will have 5 dimensions if you include the batch. This is why many tensor low-level kernels currently implement the transforms using negative indexes (-1, -2, -3) for C, H, W. It is on the current roadmap of TorchVision to improve video support, so that's something we need to factor in.

cc @vfdev-5 for visibility on which transforms need to be adjusted to support videos.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can link this issue #2583 where it was an attempt to make transforms to support video data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would require the video feature that always have at least 4 dimensions

In principle, it's not outlandish to expect video tensors to have (at least) 4 dimensions, with fixed expectation for C,H,W being the last three in order to avoid having to re-implement things with negative indexes.

It depends. Some video-specific transforms can potentially use the temporal dimension but for now those that we support don't. So the API should limit us from supporting such transforms but the low-level kernels must support Videos

I'm not sure I follow. There are video transforms (temporal jittering, subsampling, ...) that we should be able to support. In principle, I wouldn't expect torchvision to support every possible scenario, but it would be helpful if basic transforms for video worked on an arbitrary dimensional (i.e. [..., C, H, W] ) tensor. In that way, if user wants to implement some obscure temporal transform, they can implement something operating on the T dimension (so, -4) in a way that follows common API for a video feature (or batch image feature, whatever will be easiest). Would the current implementation make that troublesome?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bjuncek I think we agree on the approach. Having the Videos stored as [..., T, C, H, W] allows us to implement most of the basic transforms with Video support. We should investigate which of the current transforms support this and which don't. Also how about the batch dimension aka [B, T, C, H, W]; any thoughts for handling that on the new API?

I'm not sure I follow.

I just wanted to bring up the fact that it's not a given that ALL transforms will operate independently on the temporal dimension. So on the new API, there might be the need to support eventually a transform that operates on (T, C, H, W) jointly. Just mentioning this so that we don't limit in on the API.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a Video feature with [..., T, C, H, W] will give us the most flexibility. All transforms that don't need the temporal information can simply use the corresponding image kernel that expects [..., C, H, W]. At the same time, if we later have transforms that need the temporal information, they are automatically supported.

5 changes: 4 additions & 1 deletion torchvision/prototype/features/_bounding_box.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ class BoundingBoxFormat(StrEnum):


class BoundingBox(Feature):
formats = BoundingBoxFormat
formats = BoundingBoxFormat # Couldn't find a use of this in code. Is there a reason why we don't just let people access the enums directly?
format: BoundingBoxFormat
image_size: Tuple[int, int]

Expand All @@ -40,6 +40,9 @@ def __new__(
def to_format(self, format: Union[str, BoundingBoxFormat]) -> "BoundingBox":
# import at runtime to avoid cyclic imports
from torchvision.prototype.transforms.functional import convert_bounding_box_format
# I think we can avoid this by not having a `to_format` method but instead require users to explicitly call the
# convert method. As far as I see, the specific method is used only once on the code, so it is something we
# could avoid all together.

if isinstance(format, str):
format = BoundingBoxFormat[format]
Expand Down
1 change: 1 addition & 0 deletions torchvision/prototype/features/_encoded.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ def image_size(self) -> Tuple[int, int]:
def decode(self) -> Image:
# import at runtime to avoid cyclic imports
from torchvision.prototype.transforms.functional import decode_image_with_pil
# Same commens as on the BoundingBox.to_format

return Image(decode_image_with_pil(self))

Expand Down
3 changes: 3 additions & 0 deletions torchvision/prototype/features/_image.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@

class ColorSpace(StrEnum):
# this is just for test purposes
# How about the transparency spaces supported by ImageReadMode?
_SENTINEL = -1
OTHER = 0
GRAYSCALE = 1
Expand Down Expand Up @@ -77,7 +78,9 @@ def guess_color_space(data: torch.Tensor) -> ColorSpace:
return ColorSpace.OTHER

def show(self) -> None:
# This is a nice to have, but not a necessary method, for this early in the prototype
to_pil_image(make_grid(self.view(-1, *self.shape[-3:]))).show()

def draw_bounding_box(self, bounding_box: BoundingBox, **kwargs: Any) -> "Image":
# Same as above and nothing that this is the only method that requires to_format().
return Image.new_like(self, draw_bounding_boxes(self, bounding_box.to_format("xyxy").view(-1, 4), **kwargs))
4 changes: 2 additions & 2 deletions torchvision/prototype/features/_label.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ def __new__(
*,
dtype: Optional[torch.dtype] = None,
device: Optional[torch.device] = None,
like: Optional["Label"] = None,
like: Optional["Label"] = None, # Since are at Py3.7, perhaps we could do `from __future__ import annotations` now.
categories: Optional[Sequence[str]] = None,
):
label = super().__new__(cls, data, dtype=dtype, device=device)
Expand All @@ -26,7 +26,7 @@ def __new__(

@classmethod
def from_category(cls, category: str, *, categories: Sequence[str]):
categories = list(categories)
categories = list(categories) # why shallow copy here? If this method is in a loop, we run the risk of creating many shallow-copies
return cls(categories.index(category), categories=categories)

def to_categories(self):
Expand Down