Weird characters scraped - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: Weird characters scraped (/thread-40825.html) |
Weird characters scraped - samuelbachorik - Oct-01-2023 Hi i am trying to scrape webpage but in names of products i get weird characters, i will show what is in webpage sourcode and what beautiful soup scrape for me... This is how it looks like in web page trouught inspect: <strong>Clamps P&S Black 100</strong>And this is what beautifulsoup scrapes: <strong>Clapms P&S Black 100</strong>How to get it right ? Thank you RE: Weird characters scraped - SpongeB0B - Oct-28-2023 Hi @samuelbachorik, & is not a "weird" character it's an "HTML character entity references" .. Damn that a long name...https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references If you plan to use the extracted data outside HTML then you could convert those character entity references into utf-8Cheers, RE: Weird characters scraped - snippsat - Oct-28-2023 When use .text to get content out of tag it will be ok.from bs4 import BeautifulSoup html = '<strong>Clamps P&S Black 100</strong>' soup = BeautifulSoup(html, 'lxml') tag = soup.select_one('strong') >>> tag <strong>Clamps P&S Black 100</strong> >>> tag.text 'Clamps P&S Black 100' RE: Weird characters scraped - DeaD_EyE - Oct-29-2023 With stdlib from Python: import html s = "<strong>Clamps P&S Black 100</strong>" text = html.unescape(s) print(text) But BeautifulSoup does it already. You should use this 3rd party library.
|