(Apr-23-2024, 08:49 AM)Winfried Wrote: Some of the files could be Windows (latin1, iso9959-1, cp1252), others could be utf-8.
As these are
.html
some advice if you are making or saving this these files,then there is a way if using Requests and BS to always save as utf-8.
If files are already made then as bye Gribouillis there is
chardet.
So eg if i have one .html file which(i make to be latin-1) and one in utf-8.
λ chardetect page_latin.html
page_latin.html: ISO-8859-1 with confidence 0.73
G:\div_code\html_utf
λ chardetect page_utf8.html
page_utf8.html: utf-8 with confidence 0.7525
from bs4 import BeautifulSoup
with open('page_latin.html', encoding='latin-1') as fp:
soup = BeautifulSoup(fp, 'lxml')
h1_tag = soup.find('h1')
print(h1_tag)
# Utf-8 the default
with open('html_new.html') as fp:
soup = BeautifulSoup(fp, 'lxml')
h1_tag = soup.find('h1')
print(h1_tag)
Output:
<h1>Jalapeñod je pèle</h1>
<h1>Jalapeñod je pèle</h1>
So all works as it should,if take away
encoding='latin-1'
it break and get
UnicodeDecodeError
.
Can also convert to utf-8 as this happens when open a file in Beautiful Soup:
Bs4 Doc Wrote:Any HTML or XML document is written in a specific encoding like ASCII or UTF-8.
But when you load that document into Beautiful Soup, you’ll discover it’s been converted to Unicode.
Beautiful Soup uses a sub-library called Unicode, Dammit
to detect a document’s encoding and convert it to Unicode
.
So from latin-1 to utf-8.
from bs4 import BeautifulSoup
with open('page_latin.html', 'rb') as fp,open('html_new.html', 'w', encoding='utf-8') as fp_out:
file_out = fp.read()
# When open a file in BS it will be Unicode
soup = BeautifulSoup(file_out, 'lxml')
fp_out.write(soup.prettify())
λ chardetect html_new.html
html_new.html: utf-8 with confidence 0.7525
File used in test,same just with different encoding.
<html lang="en">
<head>
<title>Here is site title</title>
</head>
<body>
<h1>Jalapeñod je pèle</h1>
</body>
</html>