Character confusion fix suggestion

### Environment

* **Tesseract Version**: 4.1.1
* **Platform**: 4.15.0-122-generic #124-Ubuntu SMP


Hello,
We utilize Tesseract a lot in our platform, and we most often had the following issue:
For example, if we had a sequence "2032BA065" in the image, then we would get as output: "2032BA0O65".
But this happens to other characters  too, for example B -> B8, 5-> 5S. After some investigation and debugging, we came up with a fix where all cases (at least in our dataset) are corrected.

It happens at two time stamps very close (t, t+1) on the characters. Their confidence probabilities are too close to each other at time step t and time step t+1, compared to no confusing characters where confidence is close to 1.0 at each time step. Unfortunately, Tesseract doesn't filter out this kind of duplication between confused characters. To fix this issue, let's call P(t), P(t+1) the probability of recognized characters at consecutive time steps t  and t+1 respectively. 

D(t+1) = P(t+1) / P(t) + P(t+1),
where D(t+1) defines the confusion metric, and iif D(t+1) < threshold then we stop and ignore the confused character.
 
In, **src/lstm/recodebeam.cpp**, between line 907 and 908, we add:

### Suggested Fix:
```
if (prev != nullptr and code > 0 and code != 139 and prev->code !=139 and prev->code > 0)
      {
        const float sum_proba_prev_current = std::max(outputs[code], outputs[prev->code]) + std::min(outputs[code], outputs[prev->code]);

        const float ratio_scores = outputs[code] / sum_proba_prev_current;
        if (ratio_scores < 0.88f) break;
      }
```
The threshold 0.88 is experimentally set up, but I hope that this could be of help to address this issue in next versions and generalize well.

Unfortunately, I cannot provide any documents because we work on sensitive data.

Thank you.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Character confusion fix suggestion #3144

Environment

Suggested Fix:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Character confusion fix suggestion #3144

Description

Environment

Suggested Fix:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions