Python Forum
Right way to open files with different encodings?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Right way to open files with different encodings?
#3
(Apr-23-2024, 08:49 AM)Winfried Wrote: Some of the files could be Windows (latin1, iso9959-1, cp1252), others could be utf-8.
As these are .html some advice if you are making or saving this these files,then there is a way if using Requests and BS to always save as utf-8.
If files are already made then as bye Gribouillis there is chardet.
So eg if i have one .html file which(i make to be latin-1) and one in utf-8.
λ chardetect page_latin.html
page_latin.html: ISO-8859-1 with confidence 0.73

G:\div_code\html_utf
λ chardetect page_utf8.html
page_utf8.html: utf-8 with confidence 0.7525
from bs4 import BeautifulSoup

with open('page_latin.html', encoding='latin-1') as fp:
    soup = BeautifulSoup(fp, 'lxml')
    h1_tag = soup.find('h1')
    print(h1_tag)

# Utf-8 the default
with open('html_new.html') as fp:
    soup = BeautifulSoup(fp, 'lxml')
    h1_tag = soup.find('h1')
    print(h1_tag)
Output:
<h1>Jalapeñod je pèle</h1> <h1>Jalapeñod je pèle</h1>
So all works as it should,if take away encoding='latin-1' it break and get UnicodeDecodeError.

Can also convert to utf-8 as this happens when open a file in Beautiful Soup:
Bs4 Doc Wrote:Any HTML or XML document is written in a specific encoding like ASCII or UTF-8.
But when you load that document into Beautiful Soup, you’ll discover it’s been converted to Unicode.
Beautiful Soup uses a sub-library called Unicode, Dammit to detect a document’s encoding and convert it to Unicode.

So from latin-1 to utf-8.
from bs4 import BeautifulSoup

with open('page_latin.html', 'rb') as fp,open('html_new.html', 'w', encoding='utf-8') as fp_out:
    file_out = fp.read()
    # When open a file in BS it will be Unicode
    soup = BeautifulSoup(file_out, 'lxml')
    fp_out.write(soup.prettify())
λ chardetect html_new.html
html_new.html: utf-8 with confidence 0.7525
File used in test,same just with different encoding.
<html lang="en">
  <head>
    <title>Here is site title</title>
  </head>
  <body>
    <h1>Jalapeñod je pèle</h1>
  </body>
</html>
Reply


Messages In This Thread
RE: Right way to open files with different encodings? - by snippsat - Apr-23-2024, 05:50 PM

Possibly Related Threads…
Thread Author Replies Views Last Post
  Open files in an existing window instead of new Kostov 2 404 Apr-13-2024, 07:22 AM
Last Post: Kostov
  open python files in other drive akbarza 1 752 Aug-24-2023, 01:23 PM
Last Post: deanhystad
  How to open/load image .tiff files > 2 GB ? hobbyist 1 2,508 Aug-19-2021, 12:50 AM
Last Post: Larz60+
  Open and read multiple text files and match words kozaizsvemira 3 6,828 Jul-07-2021, 11:27 AM
Last Post: Larz60+
Question (solved) open multiple libre office files in libre office lucky67 5 3,440 May-29-2021, 04:54 PM
Last Post: lucky67
  Can't open files Lass86 5 2,525 Nov-10-2020, 07:18 PM
Last Post: jefsummers
  Using Python to loop csv files to open them Secret 4 2,802 Sep-13-2020, 11:30 AM
Last Post: Askic
  Find specific subdir, open files and find specific lines that are missing from a file tester_V 8 3,721 Aug-25-2020, 01:52 AM
Last Post: tester_V
  ModuleNotFoundError: no module named 'encodings' grunge10111 1 3,875 May-29-2020, 02:22 AM
Last Post: Larz60+
  subprocess.Popen() and encodings voltron 0 5,807 Feb-20-2020, 04:57 PM
Last Post: voltron

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020