Extracting html data using attributes

WiPi · Apr-28-2020, 05:44 PM

Hi guys,

I am trying to extract specific data from a block of html using Beautiful and attributes and can only get so far.
THe first 3 lines of the html block are:

<div class="pagearrange__layout-column pagearrange__layout-column--full">
<a class="anchor" name="explorers"></a>
<div id="explorer_shell_159760" data-tag="41" data-hash-tag="41" data-id="159760" data-aggregating="" data-owner="0" data-userid="0" data-validate="" class="trade_explorer flexBox noflex explorer explorer--demo explorer--loaded">

My code is:

response = requests.get(url)
content = response.content
soup = bs(content,'lxml')
ids = soup.find('div', class_ = 'pagearrange__layout-column pagearrange__layout-column--full')
print(ids)

This finds the correct block of code but I now need to extract values for the element 'data-id'.

Any ideas please?

thanks

anbu23 · Apr-28-2020, 07:14 PM

ids = soup.find('div', id = 'explorer_shell_159760')['data-id']

WiPi · Apr-28-2020, 07:21 PM

There are a couple of problems with this. If I run it as-is the output is:

Output:
TypeError: 'NoneType' object is not subscriptable

Also the id has to be non-unique - the number at the end is the same number as data-id which changes depending on the url.

***snippsat*** · (This post was last modified: Apr-28-2020, 08:34 PM by snippsat.)

from bs4 import BeautifulSoup

html = '''\
<div class="pagearrange__layout-column pagearrange__layout-column--full">
<a class="anchor" name="explorers"></a>
<div id="explorer_shell_159760" data-tag="41" data-hash-tag="41" data-id="159760" data-aggregating="" data-owner="0" data-userid="0" data-validate="" class="trade_explorer flexBox noflex explorer explorer--demo explorer--loaded">'''

soup = BeautifulSoup(html, 'lxml')

So now can test this out,first find the div tag,so here i just use first id to find the tag.
Then will attrs get all attributes in that tag,the can as show take out wanted one.

>>> tag = soup.find(id="explorer_shell_159760")
>>> tag
<div class="trade_explorer flexBox noflex explorer explorer--demo explorer--loaded" data-aggregating="" data-hash-tag="41" data-id="159760" data-owner="0" data-tag="41" data-userid="0" data-validate="" id="explorer_shell_159760"></div>
>>> 
>>> tag.attrs
{'class': ['trade_explorer',
           'flexBox',
           'noflex',
           'explorer',
           'explorer--demo',
           'explorer--loaded'],
 'data-aggregating': '',
 'data-hash-tag': '41',
 'data-id': '159760',
 'data-owner': '0',
 'data-tag': '41',
 'data-userid': '0',
 'data-validate': '',
 'id': 'explorer_shell_159760'}
>>> 
>>> tag.attrs.get('data-id')
'159760'

Testing out @anbu23 code so dos that work with test code.
This is more the way it should be done using attrs was more demo on what that dos.

>>> soup.find('div', id = 'explorer_shell_159760')['data-id']
'159760'

WiPi · Apr-28-2020, 10:30 PM

guys,

Thanks for your replies I think I have the solution. As I mentioned the numbers change depending on the URL I am looking at - i.e <div id="explorer_shell_159760" might be <div id="explorer_shell_187462" in another URL so we have to search non-uniquely.
With your help and for completeness I believe this code works:

from bs4 import BeautifulSoup as bs
import re


response = requests.get(url)
html = response.content
soup = bs(html,'lxml')
tag = soup.find_all(id = re.compile('explorer_shell_.*'))
for data in tag:
    d=data.get('data-id')
    print(d)

and for the particular URL I was looking at the output:

Output:169816
170535
170727
171268
181972
181973
185996
186092

WiPi · May-04-2020, 08:54 AM

Hi guys,

I'm sorry to re-ignite this thread bit I am really struggling to extract data from another set of html!!
Here's the block I am interested in:

<tbody class="explorer_tradeslist__tbody"> 
 <tr id="trade_349236564" data-ticket="349236564" class="explorer_tradeslist__row "> 
    <td class="slidetable__cell slidetable__cell--fixed" style="width: 63px; min-width: 63px;"> 
        <a id="snap_180400_trade_349236564" class="explorer__anchor explorer__anchor--trade"></a>

NZD/CAD

</td> 
<td style="width: 20px; min-width: 20px;"></td> <td style="width: 103px; min-width: 103px;">

I am trying to extract the text 'NZDCAD'.

These are all the variants I have tried so far...all unsuccessful!

tag=soup.find('a',class_ = 'explorer__anchor explorer__anchor--trade')
tag=soup.find_all('td',class_ = 'slidetable__cell slidetable__cell--fixed')
table=soup.find('table',class_='explorer_tradeslist__table alternating slidetable__table')
table=soup.find('tbody',class_='explorer_tradeslist__tbody')
tag=soup.find(class_='explorer_tradeslist__tbody',attrs={'id':'snap_180400_trade_349236564'})
tag=soup.find(class_='slidetable__cell slidetable__cell--fixed',attrs={'id':'snap_180400_trade_349236564'})
tag=soup.find_all('a',attrs={'id':'snap_180400_trade_349236564'})

Are you able to put me out of my misery please?

anbu23 · May-04-2020, 09:04 AM

import bs4
html_string='''<tbody class="explorer_tradeslist__tbody"> 
 <tr id="trade_349236564" data-ticket="349236564" class="explorer_tradeslist__row "> 
    <td class="slidetable__cell slidetable__cell--fixed" style="width: 63px; min-width: 63px;"> 
        <a id="snap_180400_trade_349236564" class="explorer__anchor explorer__anchor--trade"></a>
 
NZD/CAD
 
</td> 
<td style="width: 20px; min-width: 20px;"></td> <td style="width: 103px; min-width: 103px;">
'''
soup = bs4.BeautifulSoup(html_string)
soup.find("td",class_="slidetable__cell slidetable__cell--fixed").text

WiPi · (This post was last modified: May-04-2020, 10:16 AM by WiPi.)

I tried this and the output was 'Develop' which is weird as this is in the full html but under different tag names:

<li class="left noborder nolink explorer__headerli explorer__headerli--title"> 
 <strong class="explorer__titlesegment explorer__titlesegment--title">
  <span class="icon icon--explorer-demo"></span>
   Develop
 </strong> 
 <span class="explorer__titlesegment explorer__titlesegment--pipe">|</span>

anbu23 · (This post was last modified: May-04-2020, 10:34 AM by anbu23.)

Its wierd. Can you post code you tried?

WiPi · (This post was last modified: May-04-2020, 11:32 AM by WiPi.)

from bs4 import BeautifulSoup as bs
import requests
html = 'url'
html = response.content
soup = bs(html,'lxml')
tag=soup.find("td",class_="slidetable__cell slidetable__cell--fixed").text

Actually I don't know what I did first time around but this returns 'None' now so I guess it didn't find anything.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Trying to scrape data from HTML with no identifiers	pythonpaul32	2	962	Dec-02-2023, 03:42 AM Last Post: pythonpaul32
	Post HTML Form Data to API Endpoints	Dexty	0	1,449	Nov-11-2021, 10:51 PM Last Post: Dexty
	HTML multi select HTML listbox with Flask/Python	rfeyer	0	4,746	Mar-14-2021, 12:23 PM Last Post: rfeyer
	Cleaning HTML data using Jupyter Notebook	jacob1986	7	4,256	Mar-05-2021, 10:44 PM Last Post: snippsat
	Extracting the Address tag from multiple HTML files using BeautifulSoup	Dredd	8	5,074	Jan-25-2021, 12:16 PM Last Post: Dredd
	Any way to remove HTML tags from scraped data? (I want text only)	SeBz2020uk	1	3,541	Nov-02-2020, 08:12 PM Last Post: Larz60+
	Easy HTML Parser: Validating trs by attributes several tags deep?	runswithascript	7	3,694	Aug-14-2020, 10:58 PM Last Post: runswithascript
	html data cell attribute issue	delahug	5	3,236	May-31-2020, 09:18 AM Last Post: delahug
	extrat data from a button html	windows11	1	2,026	Mar-24-2020, 03:39 PM Last Post: Larz60+
	Python3 + BeautifulSoup4 + lxml (HTML -> CSV) - How to loop to next HTML/new CSV Row	BrandonKastning	0	2,415	Mar-22-2020, 06:10 AM Last Post: BrandonKastning

Extracting html data using attributes

User Panel Messages

Announcements