Help Scraping links and table from link - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: Help Scraping links and table from link (/thread-40869.html) Pages:
1
2
|
Help Scraping links and table from link - cartonics - Oct-06-2023 from bs4 import BeautifulSoup from bs4.dammit import EncodingDetector import requests parser = 'html.parser' # or 'lxml' (preferred) or 'html5lib', if installed resp = requests.get("https://www.sbostats.com/soccer/league/italy/serie-a") http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True) encoding = html_encoding or http_encoding soup = BeautifulSoup(resp.content, parser, from_encoding=encoding) #print (soup) ##for link in soup.select('a[href^="/soccer/stats?"]'): ## #print ('https://www.sbostats.com/soccer/stats?country=italy&league=serie-a"e=1.44&direction=away&id=Mzk5OTk5MQ==') ## href1 = ['href'] ## # a"e ## c = ('https://www.sbostats.com'+link['href']) ## x = c.replace('"e', ""e") ## print (x) data = [] table = soup.find_all('table',attrs={'class':'updated_next_results_table'}) #, print (table) rows = soup.find_all('tr') for row in rows: cols = row.find_all('td') #, attrs={'class':'widget-results__team-name match-name'} cols = [ele.text.strip() for ele in cols] data.append([ele for ele in cols if ele]) # Get rid of empty values print (data) i am able to take links and datas but my expected result is from this link https://www.sbostats.com/soccer/league/italy/serie-a for each match have values of names of teams from the table and the relative link. RE: Help Scraping links and table from link - cartonics - Oct-09-2023 no one can help me? if i use attrs={'class':'widget-results__team-name match-name'} is empty [] RE: Help Scraping links and table from link - snippsat - Oct-09-2023 Here a example on how to print out the whole table. Now get hfref back in need full address just concat all with https://www.sbostats.com .from bs4 import BeautifulSoup from bs4.dammit import EncodingDetector import requests parser = 'html.parser' # or 'lxml' (preferred) or 'html5lib', if installed resp = requests.get("https://www.sbostats.com/soccer/league/italy/serie-a") http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True) encoding = html_encoding or http_encoding soup = BeautifulSoup(resp.content, parser, from_encoding=encoding) table = soup.find_all('table',attrs={'class':'updated_next_results_table'}) table = table[0] tr = table.find_all('tr') for row in tr: if row.text == None: pass if row.find('a') == None: pass else: print(row.text) print(f"{row.find('a')['href']}\n")
RE: Help Scraping links and table from link - cartonics - Oct-09-2023 thanks so much now i'll try to understand better the code ... if i want to remove some data for example from Monza STATS Salernitana 1.73 3.80 4.50 to Monza - Salernitana i have to save them in a txt and then edit or can be done on the fly removing that "td" of table with beautifolsoup? i have also to replace some text in the url x = c.replace('"e', ""e") but i can solve later this RE: Help Scraping links and table from link - snippsat - Oct-09-2023 (Oct-09-2023, 02:50 PM)cartonics Wrote: if i want to remove some data for exampleWhen do row.text then is just a string and BS has done it's job.So if what to change output now have to use Python string methods or eg regex. >>> tr[2].text ' Monza STATS Salernitana 1.73 3.80 4.50 ' >>> tr[2].text.replace('STATS', '-').split() ['Monza', '-', 'Salernitana', '1.73', '3.80', '4.50'] >>> ' '.join(row.text.replace('STATS', '-').split()) 'Fiorentina - Empoli 1.44 4.33 7.50'Then code will be: from bs4 import BeautifulSoup from bs4.dammit import EncodingDetector import requests parser = 'html.parser' # or 'lxml' (preferred) or 'html5lib', if installed resp = requests.get("https://www.sbostats.com/soccer/league/italy/serie-a") http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True) encoding = html_encoding or http_encoding soup = BeautifulSoup(resp.content, parser, from_encoding=encoding) table = soup.find_all('table',attrs={'class':'updated_next_results_table'}) table = table[0] tr = table.find_all('tr') for row in tr: if row.text == None: pass if row.find('a') == None: pass else: #print(row.text) print(' '.join(row.text.replace('STATS', '-').split())) print(f"{row.find('a')['href']}\n")
RE: Help Scraping links and table from link - cartonics - Oct-10-2023 from bs4 import BeautifulSoup from bs4.dammit import EncodingDetector import requests parser = 'html.parser' # or 'lxml' (preferred) or 'html5lib', if installed resp = requests.get("https://www.sbostats.com/soccer/league/italy/serie-a") http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True) encoding = html_encoding or http_encoding soup = BeautifulSoup(resp.content, parser, from_encoding=encoding) table = soup.find_all('table',attrs={'class':'updated_next_results_table'}) table = table[0] tr = table.find_all('tr') for row in tr: if row.text == None: pass if row.find('a') == None: pass else: #print(row.text) #print(' '.join(row.text.replace('STATS', '-').split())) #print(f"{row.find('a')['href']}\n") y= f"{row.find('a')['href']}\n" x= ' '.join(row.text.replace('STATS', '-').split()) q= ''.join([i for i in x if not i.isdigit()]) c = ('*https://www.sbostats.com' + y) z = c.replace('"e', ""e") #print(x + z) f = open("matches.txt", "a") #f.write([x] +[y]) f.write(str(q) + ' ' + str(z)) f.close()i edited the link for my needs i have only to understand how to remove all numbers of odds tryed this q= ''.join([i for i in x if not i.isdigit()]) but in output i find . points that remains from decimals so i added q1= ' '.join(q.replace('.', '').split()) does the work but i think is a very dirty solution.. i think that is the worst solution :) RE: Help Scraping links and table from link - snippsat - Oct-10-2023 Some tips you should not have open file in the loop,same for *https://www.sbostats.com which is a value that don't change.So a example like this i use with open(close file object automatic). from bs4 import BeautifulSoup from bs4.dammit import EncodingDetector import requests parser = 'html.parser' # or 'lxml' (preferred) or 'html5lib', if installed resp = requests.get("https://www.sbostats.com/soccer/league/italy/serie-a") http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True) encoding = html_encoding or http_encoding soup = BeautifulSoup(resp.content, parser, from_encoding=encoding) table = soup.find_all('table',attrs={'class':'updated_next_results_table'}) table = table[0] tr = table.find_all('tr') base_url = '*https://www.sbostats.com' with open('matches.txt', 'a') as fp: for row in tr: if row.text == None: pass if row.find('a') == None: pass else: #print(' '.join(row.text.replace('STATS', '-').split()[:3])) #print(f"{base_url}{row.find('a')['href']}\n") fp.write(f"{' '.join(row.text.replace('STATS', '-').split()[:3])}\n") fp.write(f"{base_url}{row.find('a')['href']}\n\n") I like output better like this,but you can just change to have all one line as in your example.Quote:i edited the link for my needs i have only to understand how to remove all numbers of oddsAlso i guess that you have tested all this in the loop,this how i testet only one value(interactive interpret) then added to loop. >>> tr[2] <tr> <td class="widget-results__team-details ovf updated_m130"> <span class="widget-results__team-name match-name" data-original-title="Verona" data-placement="bottom" data-toggle="tooltip">Verona</span> </td> <td class="widget-results__score text-center limitstats"> <a class="btn btn-primary btn-xs" href='/soccer/stats?country=italy&league=serie-a"e=1.50&direction=away&id=NDAxMTg3OA=='>STATS</a> </td> <td class="widget-results__team-details ovf updated_m130 text-right"> <div class="row"> <div class="col-sm-3"> </div> <div class="col-sm-9"> <span class="widget-results__team-name match-name" data-original-title="Napoli" data-placement="bottom" data-toggle="tooltip"> Napoli </span> </div> </div> </td> <td class="widget-results__quote"> <span class="" style="">6.50</span> </td> <td class="widget-results__quote"> <span class="">4.00</span> </td> <td class="widget-results__quote"> <span class="match_fav" style="">1.50</span> </td> </tr> >>> >>> ' '.join(tr[2].text.replace('STATS', '-').split()) 'Fiorentina - Empoli 1.44 4.33 7.50' >>> # Remove odds >>> ' '.join(tr[2].text.replace('STATS', '-').split()[:3]) 'Verona - Napoli' RE: Help Scraping links and table from link - cartonics - Oct-10-2023 A stupid question... why if in the source code in the link there is serie-a"e scraping become =serie-a"e is it a problem of encoding ?? RE: Help Scraping links and table from link - snippsat - Oct-10-2023 (Oct-10-2023, 01:32 PM)cartonics Wrote: A stupid question... why if in the source code in the link there isYes,and the reason is your code 😉 Remove the encoding stuff you start with and use lxml as parser,then the links will work. from bs4 import BeautifulSoup from bs4.dammit import EncodingDetector import requests parser = 'lxml' # or 'lxml' (preferred) or 'html5lib', if installed resp = requests.get("https://www.sbostats.com/soccer/league/italy/serie-a") soup = BeautifulSoup(resp.content, parser) table = soup.find_all('table', attrs={'class':'updated_next_results_table'}) table = table[0] tr = table.find_all('tr') base_url = '*https://www.sbostats.com' with open('matches.txt', 'a') as fp: for row in tr: if row.text == None: pass if row.find('a') == None: pass else: #print(' '.join(row.text.replace('STATS', '-').split()[:3])) #print(f"{base_url}{row.find('a')['href']}\n") fp.write(f"{' '.join(row.text.replace('STATS', '-').split()[:3])}\n") fp.write(f"{base_url}{row.find('a')['href']}\n\n")
RE: Help Scraping links and table from link - cartonics - Oct-11-2023 Thank you so much for your help and cause i can understand.. i am so new to python only few days and it seems really promising... Now it does all that i needed.. but i have a "didactical" question from here: <tr> <td class="widget-results__team-details ovf updated_m130"> <span class="widget-results__team-name match-name" data-original-title="Verona" data-placement="bottom" data-toggle="tooltip">Verona</span> </td> <td class="widget-results__score text-center limitstats"> <a class="btn btn-primary btn-xs" href='/soccer/stats?country=italy&league=serie-a"e=1.50&direction=away&id=NDAxMTg3OA=='>STATS</a> </td> <td class="widget-results__team-details ovf updated_m130 text-right"> <div class="row"> <div class="col-sm-3"> </div> <div class="col-sm-9"> <span class="widget-results__team-name match-name" data-original-title="Napoli" data-placement="bottom" data-toggle="tooltip"> Napoli </span> </div> </div> </td> <td class="widget-results__quote"> <span class="" style="">6.50</span> </td> <td class="widget-results__quote"> <span class="">4.00</span> </td> <td class="widget-results__quote"> <span class="match_fav" style="">1.50</span> </td> </tr> my first idea was to take only the tags widget-results__team-name match-name and btn btn-primary btn-xs is there something to achieve that? another question: if there is more than one table in link for example here. https://www.sbostats.com/soccer/league/italy/serie-c-group-c is it possible to scrape only the second one [Image: img.png] i think the trick can be done here: table = table[0] but i want always the table after the words "PARTITE CONCLUSE" and is not always table[0] |