Right way to open files with different encodings?

Winfried · Apr-23-2024, 08:49 AM

Hello,

Some of the files could be Windows (latin1, iso9959-1, cp1252), others could be utf-8.

Is try/except the right way to do it?

#with open(file, 'r') as f:
#with open(file, 'r',encoding='utf-8') as f:
#latin1, iso9959-1, cp1252
with open(file, 'r',encoding='latin-1') as f:
  content_text = f.read()

  soup = BeautifulSoup(content_text, 'html.parser')

Thank you.

**Gribouillis** · Apr-23-2024, 09:15 AM

(Apr-23-2024, 08:49 AM)Winfried Wrote: Is try/except the right way to do it?

Normally, there is no way to decode a file having an unknown unicode encoding. Specialized modules such as chardet contain tools to guess the encoding of a file. It is probably the best solution, but read the FAQ of the chardet module first.

Python is not equipped with tools to guess encodings, so attempting to decode and catch exceptions will succeed in diagnosing that some encodings are not the actual encoding of the file, but a success does not mean that it is the correct encoding an the result can be a mojibake

***snippsat*** · (This post was last modified: Apr-23-2024, 05:50 PM by snippsat.)

(Apr-23-2024, 08:49 AM)Winfried Wrote: Some of the files could be Windows (latin1, iso9959-1, cp1252), others could be utf-8.

As these are .html some advice if you are making or saving this these files,then there is a way if using Requests and BS to always save as utf-8.
If files are already made then as bye Gribouillis there is chardet.
So eg if i have one .html file which(i make to be latin-1) and one in utf-8.

λ chardetect page_latin.html
page_latin.html: ISO-8859-1 with confidence 0.73

G:\div_code\html_utf
λ chardetect page_utf8.html
page_utf8.html: utf-8 with confidence 0.7525

from bs4 import BeautifulSoup

with open('page_latin.html', encoding='latin-1') as fp:
    soup = BeautifulSoup(fp, 'lxml')
    h1_tag = soup.find('h1')
    print(h1_tag)

# Utf-8 the default
with open('html_new.html') as fp:
    soup = BeautifulSoup(fp, 'lxml')
    h1_tag = soup.find('h1')
    print(h1_tag)

Output:<h1>Jalapeñod je pèle</h1>
<h1>Jalapeñod je pèle</h1>

So all works as it should,if take away encoding='latin-1' it break and get UnicodeDecodeError.

Can also convert to utf-8 as this happens when open a file in Beautiful Soup:

Bs4 Doc Wrote:Any HTML or XML document is written in a specific encoding like ASCII or UTF-8.
But when you load that document into Beautiful Soup, you’ll discover it’s been converted to Unicode.
Beautiful Soup uses a sub-library called Unicode, Dammit to detect a document’s encoding and convert it to Unicode.

So from latin-1 to utf-8.

from bs4 import BeautifulSoup

with open('page_latin.html', 'rb') as fp,open('html_new.html', 'w', encoding='utf-8') as fp_out:
    file_out = fp.read()
    # When open a file in BS it will be Unicode
    soup = BeautifulSoup(file_out, 'lxml')
    fp_out.write(soup.prettify())

λ chardetect html_new.html
html_new.html: utf-8 with confidence 0.7525

File used in test,same just with different encoding.

<html lang="en">
  <head>
    <title>Here is site title</title>
  </head>
  <body>
    <h1>Jalapeñod je pèle</h1>
  </body>
</html>

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Open files in an existing window instead of new	Kostov	2	358	Apr-13-2024, 07:22 AM Last Post: Kostov
	open python files in other drive	akbarza	1	734	Aug-24-2023, 01:23 PM Last Post: deanhystad
	How to open/load image .tiff files > 2 GB ?	hobbyist	1	2,485	Aug-19-2021, 12:50 AM Last Post: Larz60+
	Open and read multiple text files and match words	kozaizsvemira	3	6,792	Jul-07-2021, 11:27 AM Last Post: Larz60+
	(solved) open multiple libre office files in libre office	lucky67	5	3,399	May-29-2021, 04:54 PM Last Post: lucky67
	Can't open files	Lass86	5	2,498	Nov-10-2020, 07:18 PM Last Post: jefsummers
	Using Python to loop csv files to open them	Secret	4	2,773	Sep-13-2020, 11:30 AM Last Post: Askic
	Find specific subdir, open files and find specific lines that are missing from a file	tester_V	8	3,668	Aug-25-2020, 01:52 AM Last Post: tester_V
	ModuleNotFoundError: no module named 'encodings'	grunge10111	1	3,859	May-29-2020, 02:22 AM Last Post: Larz60+
	subprocess.Popen() and encodings	voltron	0	5,775	Feb-20-2020, 04:57 PM Last Post: voltron

Right way to open files with different encodings?

User Panel Messages

Announcements