-
Notifications
You must be signed in to change notification settings - Fork 10.3k
Description
Environment
- Tesseract Version: 4.1.1
- Platform: 4.15.0-122-generic OpenCL error codes, then junk output -- possibly a build issue? #124-Ubuntu SMP
Hello,
We utilize Tesseract a lot in our platform, and we most often had the following issue:
For example, if we had a sequence "2032BA065" in the image, then we would get as output: "2032BA0O65".
But this happens to other characters too, for example B -> B8, 5-> 5S. After some investigation and debugging, we came up with a fix where all cases (at least in our dataset) are corrected.
It happens at two time stamps very close (t, t+1) on the characters. Their confidence probabilities are too close to each other at time step t and time step t+1, compared to no confusing characters where confidence is close to 1.0 at each time step. Unfortunately, Tesseract doesn't filter out this kind of duplication between confused characters. To fix this issue, let's call P(t), P(t+1) the probability of recognized characters at consecutive time steps t and t+1 respectively.
D(t+1) = P(t+1) / P(t) + P(t+1),
where D(t+1) defines the confusion metric, and iif D(t+1) < threshold then we stop and ignore the confused character.
In, src/lstm/recodebeam.cpp, between line 907 and 908, we add:
Suggested Fix:
if (prev != nullptr and code > 0 and code != 139 and prev->code !=139 and prev->code > 0)
{
const float sum_proba_prev_current = std::max(outputs[code], outputs[prev->code]) + std::min(outputs[code], outputs[prev->code]);
const float ratio_scores = outputs[code] / sum_proba_prev_current;
if (ratio_scores < 0.88f) break;
}
The threshold 0.88 is experimentally set up, but I hope that this could be of help to address this issue in next versions and generalize well.
Unfortunately, I cannot provide any documents because we work on sensitive data.
Thank you.