Python Forum
OCR question - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Forum & Off Topic (https://python-forum.io/forum-23.html)
+--- Forum: Bar (https://python-forum.io/forum-27.html)
+--- Thread: OCR question (/thread-41862.html)



OCR question - DPaul - Mar-29-2024

OCR with tesseract does a very good job, we know that.
I use it to process various types of documents, some of them are just lists of people.
About a 100 years ago, people started to use typewriters, and did
not always refresh the ribbon in time, or used carbon copy ("cc") resulting in very faint text.
So tesseract, if it can't decypher whats there, comes up with random sequences of letters, like:
"... GGZ|OSEPH|SSSSSSSSF|MFIAFIFIAFIFDE|ADRUARN|IFIIFIA|FFLF|WFFI|ZFFIJFIA ..."
The pipes are things I put between detected words.
Can anybody think of a clever way to reject these words?
We're talking hundreds of thousands of lines., and some of them contain these "random" sequences.
One partial solution I thought of was to detect eg. groups of 3 identical letters ... sometimes that happens..
Any python module that I never heard of maybe? "Anti-gibberish" module?
thx,
Paul