Skip to content

VideoClips: audio clips do not correspond to video clips #2474

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
v-iashin opened this issue Jul 15, 2020 · 9 comments
Open

VideoClips: audio clips do not correspond to video clips #2474

v-iashin opened this issue Jul 15, 2020 · 9 comments

Comments

@v-iashin
Copy link

v-iashin commented Jul 15, 2020

🐛 Bug

The audio stream does not correspond to the visual stream when torchvision.datasets.video_utils.VideoClips is used.

To Reproduce

Steps to reproduce the behavior:

  1. Here are two videos I tested on Archive.zip
  2. The code to reproduce
from torchvision.io import read_video
from torchvision.datasets.video_utils import VideoClips

VIDEO_PATH = './4fpkD4A_t1s_35000_45000.mp4'
VIDEO_PATH = './small.mp4'

if __name__ == "__main__":
    print(f'I am using: {VIDEO_PATH}')
    print(f'Output using torchvision.io.read_video:')
    visual, audio, info = read_video(VIDEO_PATH, pts_unit='sec')
    print('Visual:', visual.shape, 'Audio:', audio.shape, info)
    
    print(f'Output using torchvision.datasets.video_utils.VideoClips:')
    vclips = VideoClips([VIDEO_PATH], clip_length_in_frames=30, frames_between_clips=30)
    for i in range(vclips.num_clips()):
        visual, audio, info, vid_idx = vclips.get_clip(i)
        print(f'Clip #{i}', 'Visual:', visual.shape, 'Audio:', audio.shape, info)
  1. The output I see
I am using: ./small.mp4
Output using torchvision.io.read_video:
Visual: torch.Size([166, 320, 560, 3]) Audio: torch.Size([1, 266240]) {'video_fps': 30.0, 'audio_fps': 48000}
Output using torchvision.datasets.video_utils.VideoClips:
/home/vladimir/miniconda3/envs/bug_report_video_clips/lib/python3.8/site-packages/torchvision/io/video.py:103: UserWarning: The pts_unit 'pts' gives wrong results and will be removed in a follow-up version. Please use pts_unit 'sec'.
  warnings.warn(
100.0%
Clip #0 Visual: torch.Size([30, 320, 560, 3]) Audio: torch.Size([1, 86016]) {'video_fps': 30.0, 'audio_fps': 48000}
Clip #1 Visual: torch.Size([30, 320, 560, 3]) Audio: torch.Size([1, 87142]) {'video_fps': 30.0, 'audio_fps': 48000}
Clip #2 Visual: torch.Size([30, 320, 560, 3]) Audio: torch.Size([1, 87255]) {'video_fps': 30.0, 'audio_fps': 48000}
Clip #3 Visual: torch.Size([30, 320, 560, 3]) Audio: torch.Size([1, 0]) {'video_fps': 30.0, 'audio_fps': 48000}
Clip #4 Visual: torch.Size([30, 320, 560, 3]) Audio: torch.Size([1, 0]) {'video_fps': 30.0, 'audio_fps': 48000}
I am using: ./4fpkD4A_t1s_35000_45000.mp4
Output using torchvision.io.read_video:
Visual: torch.Size([300, 720, 1280, 3]) Audio: torch.Size([2, 440320]) {'video_fps': 30.0, 'audio_fps': 44100}
Output using torchvision.datasets.video_utils.VideoClips:
/home/vladimir/miniconda3/envs/bug_report_video_clips/lib/python3.8/site-packages/torchvision/io/video.py:103: UserWarning: The pts_unit 'pts' gives wrong results and will be removed in a follow-up version. Please use pts_unit 'sec'.
  warnings.warn(
100.0%
Clip #0 Visual: torch.Size([30, 720, 1280, 3]) Audio: torch.Size([2, 14336]) {'video_fps': 30.0, 'audio_fps': 44100}
Clip #1 Visual: torch.Size([30, 720, 1280, 3]) Audio: torch.Size([2, 15360]) {'video_fps': 30.0, 'audio_fps': 44100}
Clip #2 Visual: torch.Size([30, 720, 1280, 3]) Audio: torch.Size([2, 15360]) {'video_fps': 30.0, 'audio_fps': 44100}
Clip #3 Visual: torch.Size([30, 720, 1280, 3]) Audio: torch.Size([2, 15360]) {'video_fps': 30.0, 'audio_fps': 44100}
Clip #4 Visual: torch.Size([30, 720, 1280, 3]) Audio: torch.Size([2, 15360]) {'video_fps': 30.0, 'audio_fps': 44100}
Clip #5 Visual: torch.Size([30, 720, 1280, 3]) Audio: torch.Size([2, 15360]) {'video_fps': 30.0, 'audio_fps': 44100}
Clip #6 Visual: torch.Size([30, 720, 1280, 3]) Audio: torch.Size([2, 15360]) {'video_fps': 30.0, 'audio_fps': 44100}
Clip #7 Visual: torch.Size([30, 720, 1280, 3]) Audio: torch.Size([2, 15360]) {'video_fps': 30.0, 'audio_fps': 44100}
Clip #8 Visual: torch.Size([30, 720, 1280, 3]) Audio: torch.Size([2, 15360]) {'video_fps': 30.0, 'audio_fps': 44100}
Clip #9 Visual: torch.Size([30, 720, 1280, 3]) Audio: torch.Size([2, 15360]) {'video_fps': 30.0, 'audio_fps': 44100}

Expected behavior

  1. The output of torchvision.io.read_video is ok, and it is as expected. I provide it here for reference. Also, the visual streams returned from torchvision.datasets.video_utils.VideoClips are ok.
  2. I expect the output of torchvision.datasets.video_utils.VideoClips().get_clip() to have a comparable number of samples, i.e. 48k or 44.1k for 1 second of 30 fps video. Instead, it outputs more samples than expected or just a fraction of it. Specifically, for ./small.mp4, it outputs ~87k in the first three clips and 0 in the last two (expected 48k at each clip), while for 4fpkD4A_t1s_35000_45000.mp4 it outputs ~14k for the first one and 15k for the rest of them (expected 44.1k at each). The later one does not even reach the expected 440k samples for the whole 10s video. Similarly, the earlier one totals to (86016 + 87142 + 87255) = 260413 which does not correspond to 266240 loaded in torchvision.io.read_video.

Environment

Collecting environment information...
PyTorch version: 1.5.1
Is debug build: No
CUDA used to build PyTorch: 10.2

OS: Ubuntu 18.04.4 LTS
GCC version: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
CMake version: version 3.10.2

Python version: 3.8
Is CUDA available: Yes
CUDA runtime version: Could not collect

Nvidia driver version: 440.44
cuDNN version: Could not collect

Versions of relevant libraries:
[pip3] numpy==1.18.5
[pip3] torch==1.5.1
[pip3] torchvision==0.6.0a0+35d732a
[conda] blas                      1.0                         mkl  
[conda] cudatoolkit               10.2.89              hfd86e86_1  
[conda] mkl                       2020.1                      217  
[conda] mkl-service               2.3.0            py38he904b0f_0  
[conda] mkl_fft                   1.1.0            py38h23d657b_0  
[conda] mkl_random                1.1.1            py38h0573a6f_0  
[conda] numpy                     1.18.5           py38ha1c710e_0  
[conda] numpy-base                1.18.5           py38hde5b4d6_0  
[conda] pytorch                   1.5.1           py3.8_cuda10.2.89_cudnn7.6.5_0    pytorch
[conda] torchvision               0.6.1                py38_cu102    pytorch

Additional context

Currently, VideoClips does not have a doc on the website. Therefore, my misunderstanding might arise from its absence.

@v-iashin
Copy link
Author

Ok, after digging into debugging I found out that The pts_unit 'pts' gives wrong results and will be removed in a follow-up version. Please use pts_unit 'sec'. actually relates to this bug and it is known (e.g. #1221, #1672, #1931).

@bjuncek
Copy link
Contributor

bjuncek commented Jul 23, 2020

Hi @v-iashin - have you been able to resolve this issue when using the 'sec' as a default pts_unit?
I have done some testing of that on TV 0.5/6 but not on the latest master, so if needed let me know and I'll take a look into this :)

@v-iashin
Copy link
Author

Hi @bjuncek there is no pts_unit argument in VideoClips. read_video works as I expect it to work -- reads an entire video.

@bjuncek
Copy link
Contributor

bjuncek commented Jul 24, 2020

Ah, I see that it's still set up as "pts" by default.
If you simply replace the default to "sec" on lines here and here, does that solve your issue?

If so, I'll send a PR later today or over the weekend. I can also test it out for you on Monday if that works.

@v-iashin
Copy link
Author

Nope, it does not. It fails:

Traceback (most recent call last):
  File "/home/user/project4/main.py", line 14, in <module>
    vclips = VideoClips([VIDEO_PATH], clip_length_in_frames=30, frames_between_clips=30, )
  File "/home/user/miniconda3/envs/bug_report_video_clips_nightly/lib/python3.8/site-packages/torchvision/datasets/video_utils.py", line 122, in __init__
    self._compute_frame_pts()
  File "/home/user/miniconda3/envs/bug_report_video_clips_nightly/lib/python3.8/site-packages/torchvision/datasets/video_utils.py", line 150, in _compute_frame_pts
    clips = [torch.as_tensor(c) for c in clips]
  File "/home/user/miniconda3/envs/bug_report_video_clips_nightly/lib/python3.8/site-packages/torchvision/datasets/video_utils.py", line 150, in <listcomp>
    clips = [torch.as_tensor(c) for c in clips]
RuntimeError: Could not infer dtype of Fraction

because now read_video_timestamps returns Fraction coming from video_time_base. To this end, I replaced

pts = [x * video_time_base for x in pts]

with

pts = [x * float(video_time_base) for x in pts]

Still this returns me

I am using: ./small.mp4
Output using torchvision.io.read_video:
Visual: torch.Size([166, 320, 560, 3]) Audio: torch.Size([1, 266240]) {'video_fps': 30.0, 'audio_fps': 48000}
Output using torchvision.datasets.video_utils.VideoClips:
100.0%
Clip #0 Visual: torch.Size([30, 320, 560, 3]) Audio: torch.Size([1, 0]) {'video_fps': 30.0, 'audio_fps': 48000}
Clip #1 Visual: torch.Size([30, 320, 560, 3]) Audio: torch.Size([1, 0]) {'video_fps': 30.0, 'audio_fps': 48000}
Clip #2 Visual: torch.Size([30, 320, 560, 3]) Audio: torch.Size([1, 0]) {'video_fps': 30.0, 'audio_fps': 48000}
Clip #3 Visual: torch.Size([30, 320, 560, 3]) Audio: torch.Size([1, 0]) {'video_fps': 30.0, 'audio_fps': 48000}
Clip #4 Visual: torch.Size([30, 320, 560, 3]) Audio: torch.Size([1, 0]) {'video_fps': 30.0, 'audio_fps': 48000}

torch.Size([1, 0]) I expect it to be torch.Size([~<audio_fps>, 0])

Then, I inspected it a bit and found that aframes are reasonable until

aframes = _align_audio_frames(aframes, audio_frames, start_pts, end_pts)

where pts are used again.

@MJoodaki
Copy link

Hi, I have almost the same problem with VideoClip.
I used VideoClip to create clips (each incl. 15 frames ) from a reference video (without audio) and saved the clips in a directory using torchvison.io.write_video and then I realized that instead of 15 frames/clip it got 12 frames/clip which every 2 frames in a row are duplicated.
Is that also because of the pts_unit= "pts" problem as you mentioned before or could be something else? and Do you have any suggestion for solving this issue? Apparently, in the new update, still, the "pts" is the default!
I was wondering if anybody could help me.
Many thanks

@v-iashin
Copy link
Author

Hi, I haven't figured out how to solve it yet. Also, let me know if you will find out anything.

@v-iashin
Copy link
Author

Currently on Google Colab I get:

I am using: ./small.mp4
Output using torchvision.io.read_video:
Visual: torch.Size([166, 320, 560, 3]) Audio: torch.Size([1, 266240]) {'video_fps': 30.0, 'audio_fps': 48000}
Output using torchvision.datasets.video_utils.VideoClips:
100%
1/1 [00:00<00:00, 6.29it/s]
Clip #0 Visual: torch.Size([30, 320, 560, 3]) Audio: torch.Size([1, 46080]) {'video_fps': 30.0, 'audio_fps': 48000}
Clip #1 Visual: torch.Size([30, 320, 560, 3]) Audio: torch.Size([1, 47213]) {'video_fps': 30.0, 'audio_fps': 48000}
Clip #2 Visual: torch.Size([30, 320, 560, 3]) Audio: torch.Size([1, 47344]) {'video_fps': 30.0, 'audio_fps': 48000}
Clip #3 Visual: torch.Size([30, 320, 560, 3]) Audio: torch.Size([1, 46450]) {'video_fps': 30.0, 'audio_fps': 48000}
Clip #4 Visual: torch.Size([30, 320, 560, 3]) Audio: torch.Size([1, 46581]) {'video_fps': 30.0, 'audio_fps': 48000}

and

I am using: ./4fpkD4A_t1s_35000_45000.mp4
Output using torchvision.io.read_video:
Visual: torch.Size([300, 720, 1280, 3]) Audio: torch.Size([2, 440320]) {'video_fps': 30.0, 'audio_fps': 44100}
Output using torchvision.datasets.video_utils.VideoClips:
100%
1/1 [00:01<00:00, 1.33s/it]
Clip #0 Visual: torch.Size([30, 720, 1280, 3]) Audio: torch.Size([2, 41984]) {'video_fps': 30.0, 'audio_fps': 44100}
Clip #1 Visual: torch.Size([30, 720, 1280, 3]) Audio: torch.Size([2, 42939]) {'video_fps': 30.0, 'audio_fps': 44100}
Clip #2 Visual: torch.Size([30, 720, 1280, 3]) Audio: torch.Size([2, 42869]) {'video_fps': 30.0, 'audio_fps': 44100}
Clip #3 Visual: torch.Size([30, 720, 1280, 3]) Audio: torch.Size([2, 42800]) {'video_fps': 30.0, 'audio_fps': 44100}
Clip #4 Visual: torch.Size([30, 720, 1280, 3]) Audio: torch.Size([2, 42730]) {'video_fps': 30.0, 'audio_fps': 44100}
Clip #5 Visual: torch.Size([30, 720, 1280, 3]) Audio: torch.Size([2, 42660]) {'video_fps': 30.0, 'audio_fps': 44100}
Clip #6 Visual: torch.Size([30, 720, 1280, 3]) Audio: torch.Size([2, 43615]) {'video_fps': 30.0, 'audio_fps': 44100}
Clip #7 Visual: torch.Size([30, 720, 1280, 3]) Audio: torch.Size([2, 43545]) {'video_fps': 30.0, 'audio_fps': 44100}
Clip #8 Visual: torch.Size([30, 720, 1280, 3]) Audio: torch.Size([2, 43476]) {'video_fps': 30.0, 'audio_fps': 44100}
Clip #9 Visual: torch.Size([30, 720, 1280, 3]) Audio: torch.Size([2, 43406]) {'video_fps': 30.0, 'audio_fps': 44100}

Here is the environment

Colab packages
absl-py==0.12.0
alabaster==0.7.12
albumentations==0.1.12
altair==4.1.0
appdirs==1.4.4
argcomplete==1.12.3
argon2-cffi==21.1.0
arviz==0.11.4
astor==0.8.1
astropy==4.3.1
astunparse==1.6.3
atari-py==0.2.9
atomicwrites==1.4.0
attrs==21.2.0
audioread==2.1.9
autograd==1.3
av==8.0.2
Babel==2.9.1
backcall==0.2.0
beautifulsoup4==4.6.3
bleach==4.1.0
blis==0.4.1
bokeh==2.3.3
Bottleneck==1.3.2
branca==0.4.2
bs4==0.0.1
CacheControl==0.12.10
cached-property==1.5.2
cachetools==4.2.4
catalogue==1.0.0
certifi==2021.10.8
cffi==1.15.0
cftime==1.5.1.1
chardet==3.0.4
charset-normalizer==2.0.8
click==7.1.2
cloudpickle==1.3.0
cmake==3.12.0
cmdstanpy==0.9.5
colorcet==2.0.6
colorlover==0.3.0
community==1.0.0b1
contextlib2==0.5.5
convertdate==2.3.2
coverage==3.7.1
coveralls==0.5
crcmod==1.7
cufflinks==0.17.3
cupy-cuda111==9.4.0
cvxopt==1.2.7
cvxpy==1.0.31
cycler==0.11.0
cymem==2.0.6
Cython==0.29.24
daft==0.0.4
dask==2.12.0
datascience==0.10.6
debugpy==1.0.0
decorator==4.4.2
defusedxml==0.7.1
descartes==1.1.0
dill==0.3.4
distributed==1.25.3
dlib @ file:///dlib-19.18.0-cp37-cp37m-linux_x86_64.whl
dm-tree==0.1.6
docopt==0.6.2
docutils==0.17.1
dopamine-rl==1.0.5
earthengine-api==0.1.290
easydict==1.9
ecos==2.0.7.post1
editdistance==0.5.3
en-core-web-sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz
entrypoints==0.3
ephem==4.1
et-xmlfile==1.1.0
fa2==0.3.5
fastai==1.0.61
fastdtw==0.3.4
fastprogress==1.0.0
fastrlock==0.8
fbprophet==0.7.1
feather-format==0.4.1
filelock==3.4.0
firebase-admin==4.4.0
fix-yahoo-finance==0.0.22
Flask==1.1.4
flatbuffers==2.0
folium==0.8.3
future==0.16.0
gast==0.4.0
GDAL==2.2.2
gdown==3.6.4
gensim==3.6.0
geographiclib==1.52
geopy==1.17.0
gin-config==0.5.0
glob2==0.7
google==2.0.3
google-api-core==1.26.3
google-api-python-client==1.12.8
google-auth==1.35.0
google-auth-httplib2==0.0.4
google-auth-oauthlib==0.4.6
google-cloud-bigquery==1.21.0
google-cloud-bigquery-storage==1.1.0
google-cloud-core==1.0.3
google-cloud-datastore==1.8.0
google-cloud-firestore==1.7.0
google-cloud-language==1.2.0
google-cloud-storage==1.18.1
google-cloud-translate==1.5.0
google-colab @ file:///colabtools/dist/google-colab-1.0.0.tar.gz
google-pasta==0.2.0
google-resumable-media==0.4.1
googleapis-common-protos==1.53.0
googledrivedownloader==0.4
graphviz==0.10.1
greenlet==1.1.2
grpcio==1.42.0
gspread==3.0.1
gspread-dataframe==3.0.8
gym==0.17.3
h5py==3.1.0
HeapDict==1.0.1
hijri-converter==2.2.2
holidays==0.10.5.2
holoviews==1.14.6
html5lib==1.0.1
httpimport==0.5.18
httplib2==0.17.4
httplib2shim==0.0.3
humanize==0.5.1
hyperopt==0.1.2
ideep4py==2.0.0.post3
idna==2.10
imageio==2.4.1
imagesize==1.3.0
imbalanced-learn==0.8.1
imblearn==0.0
imgaug==0.2.9
importlib-metadata==4.8.2
importlib-resources==5.4.0
imutils==0.5.4
inflect==2.1.0
iniconfig==1.1.1
intel-openmp==2021.4.0
intervaltree==2.1.0
ipykernel==4.10.1
ipython==5.5.0
ipython-genutils==0.2.0
ipython-sql==0.3.9
ipywidgets==7.6.5
itsdangerous==1.1.0
jax==0.2.25
jaxlib @ https://storage.googleapis.com/jax-releases/cuda111/jaxlib-0.1.71+cuda111-cp37-none-manylinux2010_x86_64.whl
jdcal==1.4.1
jedi==0.18.1
jieba==0.42.1
Jinja2==2.11.3
joblib==1.1.0
jpeg4py==0.1.4
jsonschema==2.6.0
jupyter==1.0.0
jupyter-client==5.3.5
jupyter-console==5.2.0
jupyter-core==4.9.1
jupyterlab-pygments==0.1.2
jupyterlab-widgets==1.0.2
kaggle==1.5.12
kapre==0.3.6
keras==2.7.0
Keras-Preprocessing==1.1.2
keras-vis==0.4.1
kiwisolver==1.3.2
korean-lunar-calendar==0.2.1
libclang==12.0.0
librosa==0.8.1
lightgbm==2.2.3
llvmlite==0.34.0
lmdb==0.99
LunarCalendar==0.0.9
lxml==4.2.6
Markdown==3.3.6
MarkupSafe==2.0.1
matplotlib==3.2.2
matplotlib-inline==0.1.3
matplotlib-venn==0.11.6
missingno==0.5.0
mistune==0.8.4
mizani==0.6.0
mkl==2019.0
mlxtend==0.14.0
more-itertools==8.12.0
moviepy==0.2.3.5
mpmath==1.2.1
msgpack==1.0.3
multiprocess==0.70.12.2
multitasking==0.0.10
murmurhash==1.0.6
music21==5.5.0
natsort==5.5.0
nbclient==0.5.9
nbconvert==5.6.1
nbformat==5.1.3
nest-asyncio==1.5.4
netCDF4==1.5.8
networkx==2.6.3
nibabel==3.0.2
nltk==3.2.5
notebook==5.3.1
numba==0.51.2
numexpr==2.7.3
numpy==1.19.5
nvidia-ml-py3==7.352.0
oauth2client==4.1.3
oauthlib==3.1.1
okgrade==0.4.3
omegaconf==2.0.6
opencv-contrib-python==4.1.2.30
opencv-python==4.1.2.30
openpyxl==2.5.9
opt-einsum==3.3.0
osqp==0.6.2.post0
packaging==21.3
palettable==3.3.0
pandas==1.1.5
pandas-datareader==0.9.0
pandas-gbq==0.13.3
pandas-profiling==1.4.1
pandocfilters==1.5.0
panel==0.12.1
param==1.12.0
parso==0.8.3
pathlib==1.0.1
patsy==0.5.2
pep517==0.12.0
pexpect==4.8.0
pickleshare==0.7.5
Pillow==7.1.2
pip-tools==6.2.0
plac==1.1.3
plotly==4.4.1
plotnine==0.6.0
pluggy==0.7.1
pooch==1.5.2
portpicker==1.3.9
prefetch-generator==1.0.1
preshed==3.0.6
prettytable==2.4.0
progressbar2==3.38.0
prometheus-client==0.12.0
promise==2.3
prompt-toolkit==1.0.18
protobuf==3.17.3
psutil==5.4.8
psycopg2==2.7.6.1
ptyprocess==0.7.0
py==1.11.0
pyarrow==3.0.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycocotools==2.0.3
pycparser==2.21
pyct==0.4.8
pydata-google-auth==1.2.0
pydot==1.3.0
pydot-ng==2.0.0
pydotplus==2.0.2
PyDrive==1.3.1
pyemd==0.5.1
pyerfa==2.0.0.1
pyglet==1.5.0
Pygments==2.6.1
pygobject==3.26.1
pymc3==3.11.4
PyMeeus==0.5.11
pymongo==3.12.1
pymystem3==0.2.0
PyOpenGL==3.1.5
pyparsing==3.0.6
pyrsistent==0.18.0
pysndfile==1.3.8
PySocks==1.7.1
pystan==2.19.1.1
pytest==3.6.4
python-apt==0.0.0
python-chess==0.23.11
python-dateutil==2.8.2
python-louvain==0.15
python-slugify==5.0.2
python-utils==2.5.6
pytz==2018.9
pyviz-comms==2.1.0
PyWavelets==1.2.0
PyYAML==6.0
pyzmq==22.3.0
qdldl==0.1.5.post0
qtconsole==5.2.1
QtPy==1.11.2
regex==2019.12.20
requests==2.23.0
requests-oauthlib==1.3.0
resampy==0.2.2
retrying==1.3.3
rpy2==3.4.5
rsa==4.8
scikit-image==0.18.3
scikit-learn==1.0.1
scipy==1.4.1
screen-resolution-extra==0.0.0
scs==2.1.4
seaborn==0.11.2
semver==2.13.0
Send2Trash==1.8.0
setuptools-git==1.2
Shapely==1.8.0
simplegeneric==0.8.1
six==1.15.0
sklearn==0.0
sklearn-pandas==1.8.0
smart-open==5.2.1
snowballstemmer==2.2.0
sortedcontainers==2.4.0
SoundFile==0.10.3.post1
spacy==2.2.4
Sphinx==1.8.6
sphinxcontrib-serializinghtml==1.1.5
sphinxcontrib-websupport==1.2.4
SQLAlchemy==1.4.27
sqlparse==0.4.2
srsly==1.0.5
statsmodels==0.10.2
sympy==1.7.1
tables==3.4.4
tabulate==0.8.9
tblib==1.7.0
tensorboard==2.7.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.0
tensorflow @ file:///tensorflow-2.7.0-cp37-cp37m-linux_x86_64.whl
tensorflow-datasets==4.0.1
tensorflow-estimator==2.7.0
tensorflow-gcs-config==2.7.0
tensorflow-hub==0.12.0
tensorflow-io-gcs-filesystem==0.22.0
tensorflow-metadata==1.4.0
tensorflow-probability==0.15.0
termcolor==1.1.0
terminado==0.12.1
testpath==0.5.0
text-unidecode==1.3
textblob==0.15.3
Theano-PyMC==1.1.2
thinc==7.4.0
threadpoolctl==3.0.0
tifffile==2021.11.2
toml==0.10.2
tomli==1.2.2
toolz==0.11.2
torch @ https://download.pytorch.org/whl/cu111/torch-1.10.0%2Bcu111-cp37-cp37m-linux_x86_64.whl
torchaudio @ https://download.pytorch.org/whl/cu111/torchaudio-0.10.0%2Bcu111-cp37-cp37m-linux_x86_64.whl
torchsummary==1.5.1
torchtext==0.11.0
torchvision @ https://download.pytorch.org/whl/cu111/torchvision-0.11.1%2Bcu111-cp37-cp37m-linux_x86_64.whl
tornado==5.1.1
tqdm==4.62.3
traitlets==5.1.1
tweepy==3.10.0
typeguard==2.7.1
typing-extensions==3.10.0.2
tzlocal==1.5.1
uritemplate==3.0.1
urllib3==1.24.3
vega-datasets==0.9.0
wasabi==0.8.2
wcwidth==0.2.5
webencodings==0.5.1
Werkzeug==1.0.1
widgetsnbextension==3.5.2
wordcloud==1.5.0
wrapt==1.13.3
xarray==0.18.2
xgboost==0.90
xkit==0.0.0
xlrd==1.1.0
xlwt==1.3.0
yellowbrick==1.3.post1
zict==2.0.0
zipp==3.6.0

Strangely the number of audio samples in the clips are not equal to the audio frame rate given we have 30 frames in a clip which is equal to video fps. Moreover, the audio clips are inconsistent in size. I would expect to have 48000 and 44100 respectively or, at least, some explainable consistensy.

Another, maybe related, observation is that if you try to make clips with clip_length_in_frames=1, frames_between_clips=1 the video clips will be as expected (1 frame per clip) but audio will be read in batches similar to the torchvision.io.VideoReader API (see clip#16 and compare the shapes to the shapes of other clips):

Clip #0 Visual: torch.Size([1, 320, 560, 3]) Audio: torch.Size([1, 0]) {'video_fps': 30.0, 'audio_fps': 48000}
Clip #1 Visual: torch.Size([1, 320, 560, 3]) Audio: torch.Size([1, 0]) {'video_fps': 30.0, 'audio_fps': 48000}
Clip #2 Visual: torch.Size([1, 320, 560, 3]) Audio: torch.Size([1, 0]) {'video_fps': 30.0, 'audio_fps': 48000}
Clip #3 Visual: torch.Size([1, 320, 560, 3]) Audio: torch.Size([1, 0]) {'video_fps': 30.0, 'audio_fps': 48000}
Clip #4 Visual: torch.Size([1, 320, 560, 3]) Audio: torch.Size([1, 0]) {'video_fps': 30.0, 'audio_fps': 48000}
Clip #5 Visual: torch.Size([1, 320, 560, 3]) Audio: torch.Size([1, 0]) {'video_fps': 30.0, 'audio_fps': 48000}
Clip #6 Visual: torch.Size([1, 320, 560, 3]) Audio: torch.Size([1, 0]) {'video_fps': 30.0, 'audio_fps': 48000}
Clip #7 Visual: torch.Size([1, 320, 560, 3]) Audio: torch.Size([1, 0]) {'video_fps': 30.0, 'audio_fps': 48000}
Clip #8 Visual: torch.Size([1, 320, 560, 3]) Audio: torch.Size([1, 0]) {'video_fps': 30.0, 'audio_fps': 48000}
Clip #9 Visual: torch.Size([1, 320, 560, 3]) Audio: torch.Size([1, 0]) {'video_fps': 30.0, 'audio_fps': 48000}
Clip #10 Visual: torch.Size([1, 320, 560, 3]) Audio: torch.Size([1, 0]) {'video_fps': 30.0, 'audio_fps': 48000}
Clip #11 Visual: torch.Size([1, 320, 560, 3]) Audio: torch.Size([1, 0]) {'video_fps': 30.0, 'audio_fps': 48000}
Clip #12 Visual: torch.Size([1, 320, 560, 3]) Audio: torch.Size([1, 0]) {'video_fps': 30.0, 'audio_fps': 48000}
Clip #13 Visual: torch.Size([1, 320, 560, 3]) Audio: torch.Size([1, 0]) {'video_fps': 30.0, 'audio_fps': 48000}
Clip #14 Visual: torch.Size([1, 320, 560, 3]) Audio: torch.Size([1, 0]) {'video_fps': 30.0, 'audio_fps': 48000}
Clip #15 Visual: torch.Size([1, 320, 560, 3]) Audio: torch.Size([1, 0]) {'video_fps': 30.0, 'audio_fps': 48000}
Clip #16 Visual: torch.Size([1, 320, 560, 3]) Audio: torch.Size([1, 1024]) {'video_fps': 30.0, 'audio_fps': 48000}
...

@v-iashin
Copy link
Author

Also, loading some videos results in a 1024 (one batch) difference in the number of audio samples from torchvision.io.read_video and torchvision.io.VideoReader

import torch
import numpy as np
import torchvision
from torchvision.datasets.video_utils import read_video, VideoClips

VIDEO_PATH = '4fpkD4A_t1s_35000_45000.mp4'
visual, audio, info = read_video(VIDEO_PATH, pts_unit='sec')
print('Audio:', audio.shape, 'Visual:', visual.shape, info)
Audio: torch.Size([2, 440320]) Visual: torch.Size([300, 720, 1280, 3])  {'video_fps': 30.0, 'audio_fps': 44100}

vs

audio_reader = torchvision.io.VideoReader(VIDEO_PATH, stream='audio')
audio = []
for frame in audio_reader:
    audio.append(frame['data'])
    audio = torch.cat(audio)
    print(audio.shape)
torch.Size([441344, 2])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants