Problem with searching over Beautiful Soap object - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: Problem with searching over Beautiful Soap object (/thread-37230.html) |
Problem with searching over Beautiful Soap object - Pavel_47 - May-15-2022 Hello, Here is BS object where I want to extract "publisher" value. The value I want to extract is Springer; 1st ed. 2020 edition (April 27, 2020). How to proceed. Thanks in advance. <ul class="a-unordered-list a-nostyle a-vertical a-spacing-none detail-bullet-list"> <li> <span class="a-list-item"> <span class="a-text-bold"> ASIN : </span> <span> B087R8CYZB </span> </span> </li> <li> <span class="a-list-item"> <span class="a-text-bold"> Publisher : </span> <span> Springer; 1st ed. 2020 edition (April 27, 2020) </span> </span> </li> <li> <span class="a-list-item"> <span class="a-text-bold"> Publication date : </span> <span> April 27, 2020 </span> </span> </li> <li> <span class="a-list-item"> <span class="a-text-bold"> Language : </span> <span> English </span> </span> </li> <li> <span class="a-list-item"> <span class="a-text-bold"> File size : </span> <span> 98586 KB </span> </span> </li> <li> <span class="a-list-item"> <span class="a-text-bold"> Text-to-Speech : </span> <span> Enabled </span> </span> </li> <li> <span class="a-list-item"> <span class="a-text-bold"> Screen Reader : </span> <span> Supported </span> </span> </li> <li> <span class="a-list-item"> <span class="a-text-bold"> Enhanced typesetting : </span> <span> Enabled </span> </span> </li> <li> <span class="a-list-item"> <span class="a-text-bold"> X-Ray : </span> <span> Not Enabled </span> </span> </li> <li> <span class="a-list-item"> <span class="a-text-bold"> Word Wise : </span> <span> Not Enabled </span> </span> </li> <li> <span class="a-list-item"> <span class="a-text-bold"> Print length : </span> <span> 832 pages </span> </span> </li> <li> <span class="a-list-item"> <span class="a-text-bold"> Lending : </span> <span> Not Enabled </span> </span> </li> </ul> RE: Problem with searching over Beautiful Soap object - Larz60+ - May-15-2022 please start here: Web-Scraping part-1 Web-scraping part-2 RE: Problem with searching over Beautiful Soap object - Pavel_47 - May-15-2022 (May-15-2022, 10:55 AM)Larz60+ Wrote: please start here:Thanks. Well. It seems I've found a solution ... probably not very elegant, but it works: for item in book_details.find_all('li'): item_name = ''.join(filter(str.isalpha, str(item.span.span.contents[0]))) if item_name == 'Publisher': item_value = item.span.contents[3].contents[0] print(item_value)where book_details is BS object from my previous post Any suggestions welcome. RE: Problem with searching over Beautiful Soap object - snippsat - May-15-2022 from bs4 import BeautifulSoup html = '''\ <li> <span class="a-list-item"> <span class="a-text-bold"> Publisher </span> <span> Springer; 1st ed. 2020 edition (April 27, 2020) </span> </span> </li> <li> <span class="a-list-item"> <span class="a-text-bold"> Publication date </span> <span> April 27, 2020 </span> </span>''' soup = BeautifulSoup(html, 'lxml') >>> tag = soup.find('span', class_="a-list-item") >>> tag.find_all('span')[0].text.strip() 'Publisher' >>> tag.find_all('span')[1].text.strip() 'Springer; 1st ed. 2020 edition (April 27, 2020)Also remember that BS support CSS Selectors . >>> tag = soup.select_one('span > span:nth-child(2)') >>> tag.text.strip() 'Springer; 1st ed. 2020 edition (April 27, 2020)'This is what i use most,as when look in browser(inspect) can copy Selector then get the path as show here automatically. RE: Problem with searching over Beautiful Soap object - Pavel_47 - May-15-2022 (May-15-2022, 02:27 PM)snippsat Wrote:from bs4 import BeautifulSoup html = '''\ <li> <span class="a-list-item"> <span class="a-text-bold"> Publisher </span> <span> Springer; 1st ed. 2020 edition (April 27, 2020) </span> </span> </li> <li> <span class="a-list-item"> <span class="a-text-bold"> Publication date </span> <span> April 27, 2020 </span> </span>''' soup = BeautifulSoup(html, 'lxml')>>> tag = soup.find('span', class_="a-list-item") >>> tag.find_all('span')[0].text.strip() 'Publisher' >>> tag.find_all('span')[1].text.strip() 'Springer; 1st ed. 2020 edition (April 27, 2020)Also remember that BS support CSS Selectors . Thanks. Well ... the task is little bit more complicated. First, in the given BS object I have to find the section that contains Publisher (in the initial text its section isn't 0). Then, once section Publisher is found, I have find associated with Publisher "value" section, i.e. section that contains "Springer ..." RE: Problem with searching over Beautiful Soap object - snippsat - May-15-2022 Can do text search if found the next tag will be the Springer tag. >>> tag = soup.find(string=re.compile('Publisher')) >>> tag '\n Publisher\n ' >>> tag.find_next() <span> Springer; 1st ed. 2020 edition (April 27, 2020) </span> RE: Problem with searching over Beautiful Soap object - Pavel_47 - May-26-2022 Hello, One more problem, related to this topic. Here is url: https://www.amazon.com/Advanced-Artificial-Intelligence-Robo-Justice-Georgios-ebook/dp/B0B1H2MZKX/ref=sr_1_1?keywords=9783030982058&qid=1653563461&sr=8-1 In this url I try to find the book title. Here is fragment of url where the book title is located: I've tried multiple ways to find it (e.g. snippet below), but all of them failed. for tag in soup.find_all('span', class_='a-size-extra-large'): print(tag)Any suggestions ? Thanks. RE: Problem with searching over Beautiful Soap object - snippsat - May-27-2022 If set User Agent with Requests(and use with Bs) i did work 1-2 times then Amazon lock it out. Quote:To discuss automated access to Amazon data please contact api-services-support@amazon.com.So with site like this usually have to use other methods like Selenium or look what there Api can give back. from selenium import webdriver from selenium.webdriver.chrome.options import Options from selenium.webdriver.common.by import By # pip install webdriver-manager from webdriver_manager.chrome import ChromeDriverManager from selenium.webdriver.chrome.service import Service import time import logging #logging.getLogger('WDM').setLevel(logging.NOTSET) #--| Setup options = Options() options.add_argument("--headless") options.add_argument("--window-size=1920,1080") options.add_experimental_option('excludeSwitches', ['enable-logging']) browser = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options) #--| Parse or automation browser.get("https://www.amazon.com/Advanced-Artificial-Intelligence-Robo-Justice-Georgios-ebook/dp/B0B1H2MZKX/ref=sr_1_1?keywords=9783030982058&qid=1653563461&sr=8-1") time.sleep(3) title = browser.find_element(By.CSS_SELECTOR, '#productTitle') print(title.text)
RE: Problem with searching over Beautiful Soap object - Pavel_47 - May-27-2022 (May-27-2022, 10:42 AM)snippsat Wrote: If set User Agent with Requests(and use with Bs) i did work 1-2 times then Amazon lock it out. Thanks. I'll try. By suggesting to use selenium, do you mean that BeautifulSoup is not capable of handling such tasks? Please note that I actually have no problem with Amazon lock. Every time I make a request on the Amazon site, I check the return status code. When it works, this code is 200. In my experience, to get locked you have to make about 60...100 requests from the same IP. The lock is held for about 2 hours, then it releases. RE: Problem with searching over Beautiful Soap object - snippsat - May-27-2022 By using Request and BS you get 200 back,but you have to look content. There you see that get detected a denned access. import requests from bs4 import BeautifulSoup user_agent = {'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.0.0 Safari/537.36'} url = 'https://www.amazon.com/Advanced-Artificial-Intelligence-Robo-Justice-Georgios-ebook/dp/B0B1H2MZKX/ref=sr_1_1?keywords=9783030982058&qid=1653563461&sr=8-1' response = requests.get(url, headers=user_agent) soup = BeautifulSoup(response.content, 'lxml') title = soup.select_one('#productTitle') print(response.status_code) print(title) print('-' * 20) print(soup.find('body'))
|