Problem with searching over Beautiful Soap object

Problem with searching over Beautiful Soap object - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: Problem with searching over Beautiful Soap object (/thread-37230.html)

Pages: 1 2 3 4

RE: Problem with searching over Beautiful Soap object - Pavel_47 - May-28-2022

Well ... if I stay at the current version of Selenium, how to find Publisher ?
I've just tried

publisher = browser.find_element_by_name('Publisher')

Search failed and threw exception.

RE: Problem with searching over Beautiful Soap object - Pavel_47 - May-28-2022

Well, this instruction do the job:

publisher = browser.find_elements_by_xpath("//*[contains(text(), 'Publisher')]")

But the real value of Publisher (i.e. Springer) is the next field.
How to advance to the next field ?

RE: Problem with searching over Beautiful Soap object - snippsat - May-28-2022

This is the old way browser.find_elements_by_xpath(Deprecated) when use Selenium 4 is like this.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import time

#--| Setup
options = Options()
options.add_argument("--headless")
options.add_argument("--window-size=1920,1080")
options.add_experimental_option('excludeSwitches', ['enable-logging'])
ser = Service(r"C:\cmder\bin\chromedriver.exe")
browser = webdriver.Chrome(service=ser, options=options)
#--| Parse or automation
url = "https://www.amazon.com/Advanced-Artificial-Intelligence-Robo-Justice-Georgios-ebook/dp/B0B1H2MZKX/ref=sr_1_1?keywords=9783030982058&qid=1653563461&sr=8-1"
browser.get(url)
title = browser.find_element(By.CSS_SELECTOR, '#productTitle')
print(title.text)
# with CSS selector
publisher = browser.find_element(By.CSS_SELECTOR, '#detailBullets_feature_div > ul > li:nth-child(2) > span > span:nth-child(2)')
print(publisher.text)
# With XPath
publisher1 = browser.find_element(By.XPATH, '//*[@id="detailBullets_feature_div"]/ul/li[2]/span/span[2]')
print(publisher1.text)

Output:Advanced Artificial Intelligence and Robo-Justice
Springer (May 16, 2022)
Springer (May 16, 2022)

RE: Problem with searching over Beautiful Soap object - Pavel_47 - May-28-2022

(May-28-2022, 02:15 PM)snippsat Wrote: This is the old way browser.find_elements_by_xpath(Deprecated) when use Selenium 4 is like this.
Output:Advanced Artificial Intelligence and Robo-Justice
Springer (May 16, 2022)
Springer (May 16, 2022)

Ok, it works.
But this method relies on the layout of this book.
With another book, the layout may be slightly different.
I think a safer method is to find the tag containing "Publisher", then move to the next tag at the same level of hierarchy, and finally extract the text from that tag.
Can selenium provide such methods.
If I remember correctly, BeautifulSoup provides a navigation functions over neighboring tags.

RE: Problem with searching over Beautiful Soap object - snippsat - May-28-2022

(May-28-2022, 02:30 PM)Pavel_47 Wrote: I think a safer method is to find the tag containing "Publisher", then move to the next tag at the same level of hierarchy, and finally extract the text from that tag.

Find the tag that hold all Product details list.

publisher = browser.find_element(By.CSS_SELECTOR, '#detailBulletsWrapper_feature_div')
print(publisher.text)

Output:Python Crash Course, 2nd Edition: A Hands-On, Project-Based Introduction to Programming
Product details
Publisher : No Starch Press; 2nd edition (May 3, 2019)
Language : English
Paperback : 544 pages
ISBN-10 : 1593279280
ISBN-13 : 978-1593279288
Reading age : 12 years and up
Lexile measure : 1050L
Item Weight : 2.3 pounds
Dimensions : 7 x 1.25 x 9.25 inches
Best Sellers Rank: #780 in Books (See Top 100 in Books)
#1 in Object-Oriented Design
#1 in Python Programming
#2 in Software Development (Books)
Customer Reviews:
6,496 ratings

Get singe element would be.

>>> p = publisher.find_elements_by_css_selector('li:nth-child(1) > span > span:nth-child(2)')
>>> p
[<selenium.webdriver.remote.webelement.WebElement (session="26ba57aa713155834023884ce6f18ab7", element="43c80e5e-eee0-49d8-94d4-fd69305b17ec")>]
>>> p[0].text
'No Starch Press; 2nd edition (May 3, 2019)'
>>> p = publisher.find_elements_by_css_selector('li:nth-child(5) > span > span:nth-child(2)')
>>> p[0].text
'978-1593279288'

Quote:If I remember correctly, BeautifulSoup provides a navigation functions over neighboring tags.

Can use BS with Selenium,eg in post.

RE: Problem with searching over Beautiful Soap object - Pavel_47 - May-29-2022

Product details - Ok.
Works fine. Exploring this fragment we can extract Publisher and date.
I trued also CSS_SELECTOR for finding Author (please see screenshot below)

[Image: amazon-search-author-book-location.png]

This method doesn't work for Author.
I tried using find_element_by_class_name. Doesn't work either.

RE: Problem with searching over Beautiful Soap object - snippsat - May-29-2022

Do you know that you can copy Css selector or XPath when over tag in inspect?
This is copy of Css selector '#bylineInfo > span'

title = browser.find_element(By.CSS_SELECTOR, '#productTitle')
print(title.text)
publisher = browser.find_element(By.CSS_SELECTOR, '#bylineInfo > span')
print(publisher.text)

Output:Python Crash Course, 2nd Edition: A Hands-On, Project-Based Introduction to Programming
Eric Matthes (Author)

RE: Problem with searching over Beautiful Soap object - Pavel_47 - May-29-2022

Not sure that I understood how it works ... I mean using '>' symbol.
Searching for Reviews section of this book:
https://www.amazon.com/Discovering-Modern-Depth-Peter-Gottschling/dp/0136677649/
I tried a more classic approach: first find the section concerned by unique ID, then search in this ID section for the information to extract using the class name (the class name gives what I want to extract - the string "3.6 by 5")

Here is snippet I used for that:

reviews_section = browser.find_element_by_id('acrPopover')
score = reviews_section.find_elements_by_class_name('a-icon-alt')
print(score[0].text)

Unfortunately the print output is empty.
Here is screenshot of the concerned fragment of book page with outlined "centers of interest":
[Image: amazon-reviews-section.png]

RE: Problem with searching over Beautiful Soap object - snippsat - May-30-2022

I mean like this click on ... or in some cases right click works.
Then is easier as you get correct selector or XPath for chosen tag.

RE: Problem with searching over Beautiful Soap object - Pavel_47 - Jun-30-2022

Tried with css_selector and class name: nothing in print output

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
 
options = Options()
options.add_argument("--headless")
browser = webdriver.Chrome('/usr/bin/chromedriver', options=options)
url = 'https://www.amazon.com/Discovering-Modern-Depth-Peter-Gottschling/dp/0136677649/'
browser.get(url)
reviews1 = browser.find_element_by_css_selector('span.a-icon-alt')
reviews2 = browser.find_element_by_class_name('a-icon-alt')
print(reviews1.text)
print(reviews2.text)