Skip to content

Conversation

@jxiong22
Copy link
Contributor

This file offers a Hugging Face version, clip-vit-base-patch32, with separate text and image encoders. It also fixes image encoder precision errors.

@mazzzystar
Copy link
Owner

@jxiong22
Thanks for your great work ! Will check out the model & precision asap : )

@mazzzystar
Copy link
Owner

I've run your scripts, and this is what I've found:

Text Encoder

max_seq_length=76
PyTorch TextEncoder ckpt out for "a photo of a cat":
>>> tensor([ 0.1555,  0.0733, -0.2448, -0.2212, -0.1934,  0.2052, -0.3175, -0.7824,
        -0.1816,  0.1943], grad_fn=<SliceBackward0>)

CoreML TextEncoder ckpt out for "a photo of a cat":
>>> [ 0.15597123  0.07306742 -0.24520203 -0.22147842 -0.1928216   0.20461065
 -0.31734398 -0.7811562  -0.1815224   0.19413628]
max_seq_length=77
PyTorch TextEncoder ckpt out for "a photo of a cat":
>>> tensor([ 0.1555,  0.0733, -0.2448, -0.2212, -0.1934,  0.2052, -0.3175, -0.7824,
        -0.1816,  0.1943], grad_fn=<SliceBackward0>)

CoreML TextEncoder ckpt out for "a photo of a cat":
>>> [ 0.15597123  0.07306742 -0.24520203 -0.22147842 -0.1928216   0.20461065
 -0.31734398 -0.7811562  -0.1815224   0.19413628]

It seems to be the same output whenever you set max_seq_length to be 77 or 76.

Image Encoder

I've compared both my original hard-coded norm vs transformer.Normalize(), and the later performs way better in L1 distance precision : )

Using hard-coded norm:

PyTorch ImageEncoder ckpt out for IMG_2115.jpg:
>>> tensor([ 0.2361, -0.0980, -0.0022,  0.2364,  0.1279, -0.1041,  0.3530,  0.0853,
        -0.0293,  0.0784], grad_fn=<SliceBackward0>)

CoreML ImageEncoder ckpt out for IMG_2115.jpg:
>>> [ 0.350561    0.025289   -0.13452446  0.1267291  -0.1897895  -0.14739564
  0.05819088  0.30193368 -0.2142085   0.27992135]

The average L1 distance is 0.173860315

Using transformers.Normalize():

PyTorch ImageEncoder ckpt out for jpg:
>>> tensor([ 0.3709,  0.0213, -0.0549,  0.1092, -0.2229, -0.2131,  0.0909,  0.2031,
        -0.2159,  0.2603], grad_fn=<SliceBackward0>)

CoreML ImageEncoder ckpt out for jpg:
>>> [ 0.34969723  0.02547197 -0.134884    0.12836841 -0.19035722 -0.14880568
  0.05857127  0.30046463 -0.21491489  0.2801294 ]

The average L1 distance is 0.037187212.

Really thanks for the PR ! This would help me to improve current image encoder performance. As for text encoder, the precision is already Okay, so that would not be a big issue I think.

Could you validate the precision influence for max_seq_length ? Once we've got the same result, I'll merge this PR : )

@jxiong22
Copy link
Contributor Author

That's weird. On my end, I can't figure out why I'm getting bad results when I set max_seq_length to 77. If 77 works well for you, that's even better. What versions of Python, coremltool, and Torch are you using?

max_seq_length = 76:
PyTorch TextEncoder ckpt out for "a photo of a cat":
>>> tensor([ 0.1555,  0.0733, -0.2448, -0.2212, -0.1934,  0.2052, -0.3175, -0.7824,
        -0.1816,  0.1943], grad_fn=<SliceBackward0>)

CoreML TextEncoder ckpt out for "a photo of a cat":
>>> [ 0.15560171  0.0732335  -0.24512495 -0.22117633 -0.19336982  0.20523793
 -0.3182205  -0.78206545 -0.18144566  0.19457956]
max_seq_length = 77:
PyTorch TextEncoder ckpt out for "a photo of a cat":
>>> tensor([ 0.1555,  0.0733, -0.2448, -0.2212, -0.1934,  0.2052, -0.3175, -0.7824,
        -0.1816,  0.1943], grad_fn=<SliceBackward0>)

CoreML TextEncoder ckpt out for "a photo of a cat":
>>> [-0.066312    0.17878246  0.40718645 -0.08806399  0.26841    -0.22685118
  0.2679821  -1.7103907  -0.33836532  0.28941655]

@mazzzystar
Copy link
Owner

My envs are:

Python 3.9.15
accelerate-0.23.0
transformers-4.33.2
coremltoos (latest, 7.0)
torch 1.13.1

@jxiong22
Copy link
Contributor Author

There has been no improvement in your environment. The results are still unsatisfactory with max_seq_length = 77 on my side.

@mazzzystar
Copy link
Owner

Okay, then I'll try to verify this issue, and feedback here if there is a progress.

@jxiong22
Copy link
Contributor Author

Thank you. If 77 is functional for you, how about setting it as the default value and including a comment for future reference in case others encounter the same issue as I did?

@mazzzystar
Copy link
Owner

Is is OKay that I merge this PR now and add a notes in README, so others can refer to your notebook for export ?

@jxiong22
Copy link
Contributor Author

jxiong22 commented Sep 22, 2023 via email

@mazzzystar mazzzystar merged commit 0cc1953 into mazzzystar:main Sep 22, 2023
@mazzzystar
Copy link
Owner

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants