Python Forum
Unidecode issue - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Forum & Off Topic (https://python-forum.io/forum-23.html)
+--- Forum: Bar (https://python-forum.io/forum-27.html)
+--- Thread: Unidecode issue (/thread-40653.html)



Unidecode issue - DPaul - Sep-02-2023

Hi,
In some pdfs I encounter references to the original parish register, like so: ref = ' RP 477; p. 148 r° '
I perform unidecode on all strings in the document : fieldUni = unidecode.unidecode(field).upper()

This has never caused any problems, except in the above case, when i get this: ' RP 477; P. 148 RDEG '

The " ° " has been "translated" into DEG. That is not what is meant here.

How do I avoid this translation in python (other then a manual ctrl-H replace '°' with ... etc.) in the text document?
thx,
Paul


RE: Unidecode issue - Gribouillis - Sep-02-2023

(Sep-02-2023, 06:42 AM)DPaul Wrote: How do I avoid this translation in python (other then a manual ctrl-H replace '°' with ... etc.) in the text document?
Which translation do you want instead of replacing '°' with 'deg'


RE: Unidecode issue - DPaul - Sep-03-2023

(Sep-02-2023, 08:45 AM)Gribouillis Wrote: Which translation do you want instead
Fair question.
Let me do some research, because I have to find out if the 'degrees' symbol
was meant to be there and has some genealogy meaning.
Or is it a faulty translation of something earlier, if the original text was eg. in access of lotus 123..
Paul


RE: Unidecode issue - DPaul - Sep-03-2023

(Sep-02-2023, 08:45 AM)Gribouillis Wrote: Which translation do you want instead
OK, there is a hidden meaning , only known to genealogists I suppose.
148 is the folio nr.
r° is recto , and...
v° means verso.
So, recto, verso would be the right translations.
I have checked the document, and indeed, some records are r°, others v°
?
Paul


RE: Unidecode issue - Gribouillis - Sep-03-2023

Use re.sub() for example
>>> import re
>>> dic = {'r°': 'recto', 'v°': 'verso'}
>>> def repl(match):
...     return dic[match.group(0)]
... 
>>> s = ' RP 477; p. 148 r° '
>>> 
>>> re.sub('[rv]°', repl, s)
' RP 477; p. 148 recto '



RE: Unidecode issue - DPaul - Sep-04-2023

(Sep-03-2023, 06:20 PM)Gribouillis Wrote: Use re.sub() for example
I thought I had to fiddle around with unidecode parameters,
but this is nice and concise.
Thanks again,
Paul