Replies: 7 comments 2 replies
-
Just to make sure about hand writing abilities, I ran that separate: |
Beta Was this translation helpful? Give feedback.
-
Another update: They now support 4 bit quantization and it appears to run well on 11GB VRAM. So CLIP doesn't seem to do too bad on quantization. |
Beta Was this translation helpful? Give feedback.
-
11 GB of ram for GPT4V performance??? 👀 👀 This is insane! |
Beta Was this translation helpful? Give feedback.
-
but this is not for its method, but model - data and calculation. and Qwen-VL can do almost the same on ocr. |
Beta Was this translation helpful? Give feedback.
-
and also see https://github.com/Alpha-VLLM/LLaMA2-Accessory/tree/main/SPHINX if you want fancy big boys |
Beta Was this translation helpful? Give feedback.
-
this type of research makes me hopeful that at some point down the line we'll be able to swap out the language models with ease-- just like restarting a server with a different model, where the tower would be dynamically configured and attached to the model. you guys think this type of framework will ever be feasible? |
Beta Was this translation helpful? Give feedback.
-
I've been analyzing different alternatives and this model is impressive. Much better than llava; but could be hard to adapt to edit: source is https://twitter.com/skalskip92/status/1727670857676271911 |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I've just seen CovVLM which is a Vicuna 7B language model behind a 9B vision tower (laion/CLIP-ViT-bigG-14-laion2B-39B-b160k) on a opensource license.
I've compared it with llava-1.5 (not even compareable) and Qwen-VL and it beats Qwen-VL by a margin in OCR abilities, detection of details and no or almost no hallucinations.
It understands handwritten as well as typed letters, context, fine details, background graphics
It can also locate tiny visual targets with pixel coordinates
I'm quite blown away that I didn't know it before..
I believe that this is what we need, it has similarities to llava but adds an additional expert model, so that's not super quick to implement.
In addition the ViT needs K-type quantization support.
Definitely worth a close look
URL: https://github.com/THUDM/CogVLM
Webdemo: http://36.103.203.44:7861/
Paper: https://github.com/THUDM/CogVLM/blob/main/assets/cogvlm-paper.pdf
Look at this example, I asked for a JSON representation - not cherry picked, it can actually extract all of the content with minimal errors:

Here is what QWEN-VL does:
Here is llava1.5-13B:
Here is GPT4-Vision:
I've not yet looked into architectural challenges. But this is literally game changer..
That's seriously good OCR and its image detection abilities are beyond anything I've remotely seen from llava 1.5/ShareGPT4V
@monatis @FSSRepo
Update:
Added OpenAI GPT4-Vision output for comparison. It is about the same, also with a mistake. I'd put CogVLM on the same level als GPT4-Vision based on the result. It's mostly a matter of prompt at this point (and GPT4 is leading on the language side)
Beta Was this translation helpful? Give feedback.
All reactions