We need CogVLM - extremely good image and text analysis, feels like a multi generational step forward. #4350

cmp-nct · 2023-12-06T23:36:44Z

cmp-nct
Dec 6, 2023

I've just seen CovVLM which is a Vicuna 7B language model behind a 9B vision tower (laion/CLIP-ViT-bigG-14-laion2B-39B-b160k) on a opensource license.
I've compared it with llava-1.5 (not even compareable) and Qwen-VL and it beats Qwen-VL by a margin in OCR abilities, detection of details and no or almost no hallucinations.
It understands handwritten as well as typed letters, context, fine details, background graphics
It can also locate tiny visual targets with pixel coordinates
I'm quite blown away that I didn't know it before..

I believe that this is what we need, it has similarities to llava but adds an additional expert model, so that's not super quick to implement.
In addition the ViT needs K-type quantization support.
Definitely worth a close look

URL: https://github.com/THUDM/CogVLM
Webdemo: http://36.103.203.44:7861/
Paper: https://github.com/THUDM/CogVLM/blob/main/assets/cogvlm-paper.pdf

Look at this example, I asked for a JSON representation - not cherry picked, it can actually extract all of the content with minimal errors:

The image appears to be a driver's license from California. Here's a JSON representation:
{
  "License_Number": "DL 11234568",
  "Expiration_Date": "08/31/2014",
  "Class": "C",
  "License_Holder": "Iva Cardholder",
  "Address": "2570 24th Street, Anytown, CA 95818",
  "Veteran": "YES",
  "Sex": "F",
  "Hair_Color": "Brown",
  "Eyes": "Brown",
  "Height": "125 lb",
  "Weight": "125 lb",
  "Issue_Date": "08/31/2009",
  "Birth_Date": "08/31/1977",
  "VISUALS": {
      "left": "Iva Cardholder",
      "right": "Iva Cardholder",
      "top": "California Bear",
      "bottom": "Gold Rush Miner"
    }
}

Here is what QWEN-VL does:

{
  "id": "123456",
  "issue_date": "08/31/2014",
  "expiration_date": "08/31/2024",
  "cardholder": {
    "name": "Ina Cordero",
    "gender": "F",
    "race": "A",
    "eye_color": "B",
    "hair_color": "R",
    "eyebrow_color": "N",
    "height": "5'6"",
    "weight": "126 lb"
  },
  "class": "C",
  "type": "DRIVER LICENSE",
  "state": "CA",
  "country": "USA"
}

Here is llava1.5-13B:

{
"image": "https://i.imgur.com/39vZv.jpg",
"description": "A California driver's license with a woman's picture on it. The license is blue and white and has a picture of a bear on it. The license number is 11324567890."
}

Here is GPT4-Vision:

{
  "State": "California",
  "Document_Type": "Driver License",
  "License_Number": "D1234568",
  "Expiration_Date": "08/31/2014",
  "Last_Name": "Cardholder",
  "First_Name": "Ima",
  "Address": "2570 24th Street Anytown, CA 95818",
  "Date_Of_Birth": "08/31/1977",
  "Restriction": "None",
  "Sex": "F",
  "Height": "5'-05\"",
  "Weight": "125 lb",
  "Hair_Color": "BRN",
  "Eye_Color": "BRN",
  "Issue_Date": "08/31/2009",
  "Veteran": "Yes",
  "Organ_Donor": "Yes",
  "Signature": "Ima Cardholder"
}

I've not yet looked into architectural challenges. But this is literally game changer..
That's seriously good OCR and its image detection abilities are beyond anything I've remotely seen from llava 1.5/ShareGPT4V

@monatis @FSSRepo

Update:
Added OpenAI GPT4-Vision output for comparison. It is about the same, also with a mistake. I'd put CogVLM on the same level als GPT4-Vision based on the result. It's mostly a matter of prompt at this point (and GPT4 is leading on the language side)

cmp-nct · 2023-12-06T23:55:54Z

cmp-nct
Dec 6, 2023
Author

Just to make sure about hand writing abilities, I ran that separate:

Here another super hard image:

0 replies

cmp-nct · 2023-12-07T15:10:47Z

cmp-nct
Dec 7, 2023
Author

Another update:
https://github.com/THUDM/CogVLM#CLI

They now support 4 bit quantization and it appears to run well on 11GB VRAM. So CLIP doesn't seem to do too bad on quantization.

0 replies

itsPreto · 2023-12-10T04:42:58Z

itsPreto
Dec 10, 2023

11 GB of ram for GPT4V performance??? 👀 👀 This is insane!

1 reply

cmp-nct Dec 10, 2023
Author

That's a default transformers quantization, I'm hopeful a proper chosen K-type scheme will beat the memory footprint while improving quality.

choyakawa · 2023-12-10T05:41:15Z

choyakawa
Dec 10, 2023

but this is not for its method, but model - data and calculation. and Qwen-VL can do almost the same on ocr.

1 reply

cmp-nct Dec 11, 2023
Author

I've shown the difference of Qwen and Cog in this thread, that's so far away .. it's not "almost the same"

choyakawa · 2023-12-10T05:42:28Z

choyakawa
Dec 10, 2023

and also see https://github.com/Alpha-VLLM/LLaMA2-Accessory/tree/main/SPHINX if you want fancy big boys

0 replies

itsPreto · 2023-12-10T06:06:44Z

itsPreto
Dec 10, 2023

this type of research makes me hopeful that at some point down the line we'll be able to swap out the language models with ease-- just like restarting a server with a different model, where the tower would be dynamically configured and attached to the model. you guys think this type of framework will ever be feasible?

0 replies

longregen · 2023-12-13T21:48:52Z

longregen
Dec 13, 2023

I've been analyzing different alternatives and this model is impressive. Much better than llava; but could be hard to adapt to llama. Maybe it's time for a visicog.cpp...

edit: source is https://twitter.com/skalskip92/status/1727670857676271911

0 replies

We need CogVLM - extremely good image and text analysis, feels like a multi generational step forward. #4350

Uh oh!

Uh oh!

cmp-nct Dec 6, 2023

Replies: 7 comments · 2 replies

Uh oh!

Uh oh!

cmp-nct Dec 6, 2023 Author

Uh oh!

cmp-nct Dec 7, 2023 Author

Uh oh!

itsPreto Dec 10, 2023

Uh oh!

cmp-nct Dec 10, 2023 Author

Uh oh!

choyakawa Dec 10, 2023

Uh oh!

cmp-nct Dec 11, 2023 Author

Uh oh!

choyakawa Dec 10, 2023

Uh oh!

Uh oh!

itsPreto Dec 10, 2023

Uh oh!

Uh oh!

longregen Dec 13, 2023

cmp-nct
Dec 6, 2023

Replies: 7 comments 2 replies

cmp-nct
Dec 6, 2023
Author

cmp-nct
Dec 7, 2023
Author

itsPreto
Dec 10, 2023

cmp-nct Dec 10, 2023
Author

choyakawa
Dec 10, 2023

cmp-nct Dec 11, 2023
Author

choyakawa
Dec 10, 2023

itsPreto
Dec 10, 2023

longregen
Dec 13, 2023