SoupStrainer: example - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: SoupStrainer: example (/thread-13037.html) Pages:
1
2
|
SoupStrainer: example - Truman - Sep-24-2018 html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ from bs4 import BeautifulSoup from bs4 import SoupStrainer def is_short_string(string): return len(string) < 10 only_short_strings = SoupStrainer(string=is_short_string) print(BeautifulSoup(html_doc, "html.parser", parse_only=only_short_strings).prettify()) I assume that the problem here is that computer doesn't know what argument string is but not sure how to solve this problem.
RE: SoupStrainer: example - Larz60+ - Sep-24-2018 no, it's telling you that you can't calculate length on empty string. you can modify: def is_short_string(string): if string: return len(string) < 10 RE: SoupStrainer: example - Truman - Sep-24-2018 The output is literally nothing. By the way, it is interesting that that code from my message is taken from BeautifulSoup docs. It surprises me that this mistake is neglected. and with this print code print(soup.find_all(only_short_strings))it gives
RE: SoupStrainer: example - Larz60+ - Sep-24-2018 well, nothing in, nothing out! try this: def is_short_string(string): print('string{}'.format(string) if string: return len(string) < 10 else: return 0by the way, you should use another name. At some point you're going to run unto an error, since string is a built-in package e.g 'import string' RE: SoupStrainer: example - Truman - Sep-24-2018 I see your point - string None (although I prefer to use f-string lol) Now I'll have to think how to add html_doc to this function. Without using function it's simple: only_a_tags = SoupStrainer("a") print(BeautifulSoup(html_doc, "html.parser", parse_only = only_a_tags).prettify()) RE: SoupStrainer: example - snippsat - Sep-25-2018 SoupStrainer with is_short_string is wrong on there website.I have only tested SoupStrainer a couple of times,so if it useful can be questionable. Can write solution that not using SoupStrainer. Can take both sentence(what SoupStrainer give back) and also length of all words. from bs4 import BeautifulSoup html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ def by_size(words, size): return [word for word in words if len(word) < size] soup = BeautifulSoup(html_doc, 'html.parser') #words = soup.text.split() sentence = soup.text.split('\n') print(by_size(words, 10))
words = soup.text.split() print(by_size(words, 4))
RE: SoupStrainer: example - Truman - Sep-25-2018 A very BeautifulSoup. By the way, line 18 looks very pythonic. Is there any topic/page that you know that explains this "trick" more thorough? RE: SoupStrainer: example - metulburr - Sep-26-2018 (Sep-25-2018, 11:29 PM)Truman Wrote: Is there any topic/page that you know that explains this "trick" more thorough?there is a detailed web scraping tutorial on our forum by snippsat https://python-forum.io/Thread-Web-Scraping-part-1 RE: SoupStrainer: example - ichabod801 - Sep-26-2018 Or were you asking about the list comprehension? RE: SoupStrainer: example - Larz60+ - Sep-26-2018 line 18 is a list comprehension, many video's and tutorials on that. I recommend one of the best tutorials by David Beazley here (I Can't swear to it, but I think iterators (list comprehension is one) are covered): http://www.dabeaz.com/generators/index.html This video I think covers it as well: https://www.youtube.com/watch?v=D1twn9kLmYg (If not you will get a ton of other goodies) |