Skip to content

Character confusion fix suggestion #3144

@EucliTs0

Description

@EucliTs0

Environment

Hello,
We utilize Tesseract a lot in our platform, and we most often had the following issue:
For example, if we had a sequence "2032BA065" in the image, then we would get as output: "2032BA0O65".
But this happens to other characters too, for example B -> B8, 5-> 5S. After some investigation and debugging, we came up with a fix where all cases (at least in our dataset) are corrected.

It happens at two time stamps very close (t, t+1) on the characters. Their confidence probabilities are too close to each other at time step t and time step t+1, compared to no confusing characters where confidence is close to 1.0 at each time step. Unfortunately, Tesseract doesn't filter out this kind of duplication between confused characters. To fix this issue, let's call P(t), P(t+1) the probability of recognized characters at consecutive time steps t and t+1 respectively.

D(t+1) = P(t+1) / P(t) + P(t+1),
where D(t+1) defines the confusion metric, and iif D(t+1) < threshold then we stop and ignore the confused character.

In, src/lstm/recodebeam.cpp, between line 907 and 908, we add:

Suggested Fix:

if (prev != nullptr and code > 0 and code != 139 and prev->code !=139 and prev->code > 0)
      {
        const float sum_proba_prev_current = std::max(outputs[code], outputs[prev->code]) + std::min(outputs[code], outputs[prev->code]);

        const float ratio_scores = outputs[code] / sum_proba_prev_current;
        if (ratio_scores < 0.88f) break;
      }

The threshold 0.88 is experimentally set up, but I hope that this could be of help to address this issue in next versions and generalize well.

Unfortunately, I cannot provide any documents because we work on sensitive data.

Thank you.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions