Python Forum
Trying to scrape data from HTML with no identifiers - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: Trying to scrape data from HTML with no identifiers (/thread-41219.html)



Trying to scrape data from HTML with no identifiers - pythonpaul32 - Nov-29-2023

I am using selenium and beautifulsoup and trying to scrape data from an html structure like this

Output:
<h2>Education</h2> Entry1 <br> Entry2 <h2>Employment
I cannot figure out how to scrape everything under the Education section. The HTML is causing problems for me. I've tried a ton of different things, but nothing seems to get the data consistently. Does anyone have any idea how I can get around the HTML.


RE: Trying to scrape data from HTML with no identifiers - snippsat - Nov-29-2023

The HTML do not show content of tag over,which have to parse to get Entry1 and Entry2.
Iterates over the contents of the div element and checks if each content is a NavigableString which mean text nodes,
so we don't get content of h2 tag.
from bs4 import BeautifulSoup, NavigableString

html = '''\
<html>
<head>
  <title>Page Title</title>
</head>
<body>
  <div class="something">
    <h2>Education</h2>
    Entry1
    <br>
    Entry2
    <h2>Employment
  </div>
</body>
</html>'''

soup = BeautifulSoup(html, 'lxml')
div = soup.find('div', class_='something')
entries = []
for content in div:
    if isinstance(content, NavigableString) and content.strip():
        entries.append(content.strip())

print(entries)
Output:
['Entry1', 'Entry2']
The HTML is not so good written so it harder to parse,would be better like eg this.
<div class="something">
    <h2>Education</h2>
    <ul>
      <li>Entry1</li>
      <li>Entry2</li>
    </ul>
    <h2>Employment</h2
</div>



RE: Trying to scrape data from HTML with no identifiers - pythonpaul32 - Dec-02-2023

Thank you. Yes, the HTML is not good. It is a pain, so I had to come up with other solutions.