Weird characters scraped - Printable Version

Weird characters scraped - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: Weird characters scraped (/thread-40825.html)

Weird characters scraped - samuelbachorik - Oct-01-2023

Hi i am trying to scrape webpage but in names of products i get weird characters, i will show what is in webpage sourcode and what beautiful soup scrape for me...

This is how it looks like in web page trouught inspect:

<strong>Clamps P&S Black 100</strong>

And this is what beautifulsoup scrapes:

<strong>Clapms P&amp;S Black 100</strong>

How to get it right ?

Thank you

RE: Weird characters scraped - SpongeB0B - Oct-28-2023

Hi @samuelbachorik,

& is not a "weird" character it's an "HTML character entity references" .. Damn that a long name...
https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references

If you plan to use the extracted data outside HTML then you could convert those character entity references into utf-8

Cheers,

RE: Weird characters scraped - snippsat - Oct-28-2023

When use .text to get content out of tag it will be ok.

from bs4 import BeautifulSoup

html = '<strong>Clamps P&S Black 100</strong>'
soup = BeautifulSoup(html, 'lxml')
tag = soup.select_one('strong')

>>> tag
<strong>Clamps P&amp;S Black 100</strong>
>>> tag.text
'Clamps P&S Black 100'

RE: Weird characters scraped - DeaD_EyE - Oct-29-2023

With stdlib from Python:

import html


s = "<strong>Clamps P&amp;S Black 100</strong>"
text = html.unescape(s)

print(text)

Output:
<strong>Clamps P&S Black 100</strong>

But BeautifulSoup does it already. You should use this 3rd party library.