Need help with Beautiful Soup - table - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: Need help with Beautiful Soup - table (/thread-14760.html) |
Need help with Beautiful Soup - table - jlkmb - Dec-15-2018 I am very much a newbie and I'm just trying to learn. Here is my code import requests from bs4 import BeautifulSoup import csv url = 'http://www.cfbstats.com/2018/team/234/index.html' r = requests.get(url) soup = BeautifulSoup(r.text, 'html.parser') table = soup.findAll("table",{"class":"team-schedule"}) for row in table: tds = row.findAll('td') for td in tds: print(td.text)The results return but by line. 09/03/18 Virginia Tech L 3-24 3:12 75,237 09/08/18 Samford W 36-26 3:51 72,239 09/15/18 @ 17 Syracuse L 7-30 3:37 37,457 09/22/18 Northern Ill. W 37-19 3:34 65,633 09/29/18 @ Louisville W 28-24 3:27 52,798 10/06/18 @ Miami (Fla.) L 27-28 4:01 65,490 10/20/18 Wake Forest W 38-17 3:34 67,274 10/27/18 2 Clemson L 10-59 3:47 68,403 11/03/18 @ North Carolina St. L 28-47 3:33 57,600 11/10/18 @ 3 Notre Dame L 13-42 3:22 77,622 11/17/18 Boston College W 22-21 3:31 57,274 11/24/18 10 Florida L 14-41 3:27 71,953 @ : Away, + : Neutral Site My goal is to return the columns with date, opponent, and attendance (at least). The last row is immaterial and needs to be removed. It would also be good to learn how to create an additional column where if you see a @ in opponent the column says A, + is N, and neither is H. The date and opponent names have classes in the table but attendance does not. Appreciate any guidance. It's just a learning exercise. RE: Need help with Beautiful Soup - table - Axel_Erfurt - Dec-15-2018 this works here from urllib.request import urlopen from bs4 import BeautifulSoup as bsoup url = 'http://www.cfbstats.com/2018/team/234/index.html' ofile = urlopen(url) soup = bsoup(ofile, "html.parser", from_encoding='utf-8') soup.prettify() table = soup.find("table", attrs={"class":"team-schedule"}) datasets = [] mytable = table.find_all("tr")#[1:] for row in mytable: text = str(row.get_text()).split('\n') datasets.append(text) _len = len(datasets) for x in range(_len -1): t = datasets[x] print((t[1] + '\t' + t[2] + '\t' + t[5]).expandtabs(30))
RE: Need help with Beautiful Soup - table - Larz60+ - Dec-15-2018 I did it a bit differently, same results: import requests from bs4 import BeautifulSoup import csv import os url = 'http://www.cfbstats.com/2018/team/234/index.html' r = requests.get(url) soup = BeautifulSoup(r.text, 'html.parser') table = soup.findAll("table",{"class": "team-schedule"})[0] trs = table.find_all('tr') header = [] for n, tr in enumerate(trs): if n == 0: # Get Header ths = tr.find_all('th') for th in ths: header.append(th.text.strip()) for item in header: print('{:22}'.format(item), end='') print() continue else: game_item = [] tds = tr.find_all('td') for td in tds: game_item.append(td.text.strip()) for item in game_item: print('{:22}'.format(item), end='') print()
RE: Need help with Beautiful Soup - table - jlkmb - Dec-16-2018 Axel - Very interesting. Thank you! Do you mind stepping through some questions/assumptions? This creates a dataset from a table that takes all rows in the table, splits the string after a space and creates a new line. The rows are then appended. datasets = [] mytable = table.find_all("tr")#[1:] for row in mytable: text = str(row.get_text()).split('\n') datasets.append(text)I'm having a real hard time following this one - and how did the headers get there? _len = len(datasets) for x in range(_len -1): t = datasets[x] print((t[1] + '\t' + t[2] + '\t' + t[5]).expandtabs(30))I have learned some code for csv writer. Below is a sample. with open('test_cfbstats.csv', 'w', newline='') as f: writer = csv.writer(f) writer.writerow(['Date', 'Opponent']) writer.writerows(data)How would you suggest modifying for use in your code? I'm not sure if the writerow would be necessary, and the writerows would change to datasets? File "cfbstats_larz.py", line 9, in <module> soup = BeautifulSoup(page, 'html.parser') NameError: name 'page' is not defined Is is something I did? (Dec-15-2018, 11:10 PM)Larz60+ Wrote: I did it a bit differently, same results: My apologies for combining the replies, I don't know what happened. RE: Need help with Beautiful Soup - table - Larz60+ - Dec-16-2018 That's me, I had renamed it page for my testing, and thought I had renamed everything, mieede line 9 which should read: soup = BeautifulSoup(r.text, 'html.parser')I also edited my original post RE: Need help with Beautiful Soup - table - Axel_Erfurt - Dec-16-2018 (Dec-16-2018, 05:02 PM)jlkmb Wrote: and how did the headers get there? The column headings are within <tr> </tr> to save it as csv: (change the delimiter to what you need) from urllib.request import urlopen from bs4 import BeautifulSoup as bsoup import csv url = 'http://www.cfbstats.com/2018/team/234/index.html' ofile = urlopen(url) soup = bsoup(ofile, "html.parser", from_encoding='utf-8') soup.prettify() table = soup.find("table", attrs={"class":"team-schedule"}) datasets = [] mytable = table.find_all("tr")#[1:] for row in mytable: text = str(row.get_text()).split('\n') datasets.append(text) mypath = '/tmp/test_cfbstats.csv' with open(mypath, 'w') as stream: writer = csv.writer(stream, delimiter='\t') _len = len(datasets) for x in range(_len -1): t = datasets[x] myrow = [t[1], t[2], t[5]] writer.writerow(myrow) RE: Need help with Beautiful Soup - table - jlkmb - Dec-17-2018 Thanks Axel - Can you explain what the following code does? _len = len(datasets) for x in range(_len -1): t = datasets[x] myrow = [t[1], t[2], t[5]] (Dec-16-2018, 06:50 PM)Axel_Erfurt Wrote:(Dec-16-2018, 05:02 PM)jlkmb Wrote: and how did the headers get there? Thanks Larz. That makes sense. A few questions just to make sure I understand. I hadn't seen enumerate yet. Interesting. Where is n defined? How about item? Can you explain the print statement? trs = table.find_all('tr') header = [] for n, tr in enumerate(trs): if n == 0: # Get Header ths = tr.find_all('th') for th in ths: header.append(th.text.strip()) for item in header: print('{:22}'.format(item), end='') print()How would I get the last line to not appear? Larz - forgot to ask - do you implement csv the same way as Axel? RE: Need help with Beautiful Soup - table - Larz60+ - Dec-17-2018 n is a name that I mage up. it could be any name that you wish. Same with item, tr, th trs In some languages you have to declare variables before using them, in Python you can declase and use in same operation. enumerate returns iteration number of loop. Quote:How would I get the last line to not appear?Not sure what you're asking here. if it's about the print() at the end, that just sends a newline, otherwise if you had additional print statements, they would end up on the same line, one after the other (the end='' suppresses a newline) RE: Need help with Beautiful Soup - table - Axel_Erfurt - Dec-17-2018 He means the last line in the table Quote:@ : Away, + : Neutral Site that' why I used _len = len(datasets) for x in range(_len -1):You do not need CSV writer for the csv file. from urllib.request import urlopen from bs4 import BeautifulSoup as bsoup url = 'http://www.cfbstats.com/2018/team/234/index.html' ofile = urlopen(url) soup = bsoup(ofile, "html.parser", from_encoding='utf-8') soup.prettify() table = soup.find("table", attrs={"class":"team-schedule"}) datasets = [] mytable = table.find_all("tr") for row in mytable: text = str(row.get_text()).split('\n') datasets.append(text) mypath = '/tmp/test_cfbstats.csv' with open(mypath, 'w') as stream: _len = len(datasets) for x in range(_len -1): t = datasets[x] myrow = [t[1], t[2], t[5]] t = "\t".join(myrow) print(t.expandtabs(22)) stream.write(t + "\n") stream.close() RE: Need help with Beautiful Soup - table - jlkmb - Dec-20-2018 Thanks guys. I appreciate your help! |