Need help with Beautiful Soup - table

Need help with Beautiful Soup - table - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: Need help with Beautiful Soup - table (/thread-14760.html)

Need help with Beautiful Soup - table - jlkmb - Dec-15-2018

I am very much a newbie and I'm just trying to learn. Here is my code

import requests
from bs4 import BeautifulSoup
import csv

url = 'http://www.cfbstats.com/2018/team/234/index.html'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')

table = soup.findAll("table",{"class":"team-schedule"})

for row in table:
 tds = row.findAll('td')
for td in tds:
  print(td.text)

The results return but by line.

09/03/18 Virginia Tech L 3-24 3:12 75,237 09/08/18 Samford W 36-26 3:51 72,239 09/15/18 @ 17 Syracuse L 7-30 3:37 37,457 09/22/18 Northern Ill. W 37-19 3:34 65,633 09/29/18 @ Louisville W 28-24 3:27 52,798 10/06/18 @ Miami (Fla.) L 27-28 4:01 65,490 10/20/18 Wake Forest W 38-17 3:34 67,274 10/27/18 2 Clemson L 10-59 3:47 68,403 11/03/18 @ North Carolina St. L 28-47 3:33 57,600 11/10/18 @ 3 Notre Dame L 13-42 3:22 77,622 11/17/18 Boston College W 22-21 3:31 57,274 11/24/18 10 Florida L 14-41 3:27 71,953 @ : Away, + : Neutral Site

My goal is to return the columns with date, opponent, and attendance (at least). The last row is immaterial and needs to be removed. It would also be good to learn how to create an additional column where if you see a @ in opponent the column says A, + is N, and neither is H.

The date and opponent names have classes in the table but attendance does not.

Appreciate any guidance. It's just a learning exercise.

RE: Need help with Beautiful Soup - table - Axel_Erfurt - Dec-15-2018

this works here

from urllib.request import urlopen
from bs4 import BeautifulSoup as bsoup

url = 'http://www.cfbstats.com/2018/team/234/index.html'

ofile = urlopen(url)
soup = bsoup(ofile, "html.parser", from_encoding='utf-8')
soup.prettify()

table = soup.find("table", attrs={"class":"team-schedule"})

datasets = []
mytable = table.find_all("tr")#[1:]
for row in mytable:
    text = str(row.get_text()).split('\n')
    datasets.append(text)

_len = len(datasets)
for x in range(_len -1):
    t = datasets[x]
    print((t[1] + '\t' + t[2] + '\t' + t[5]).expandtabs(30))

Output:Date                          Opponent                      Attendance
09/03/18                      Virginia Tech                 75,237
09/08/18                      Samford                       72,239
09/15/18                      @ 17 Syracuse                 37,457
09/22/18                      Northern Ill.                 65,633
09/29/18                      @ Louisville                  52,798
10/06/18                      @ Miami (Fla.)                65,490
10/20/18                      Wake Forest                   67,274
10/27/18                      2 Clemson                     68,403
11/03/18                      @ North Carolina St.          57,600
11/10/18                      @ 3 Notre Dame                77,622
11/17/18                      Boston College                57,274
11/24/18                      10 Florida                    71,953

RE: Need help with Beautiful Soup - table - Larz60+ - Dec-15-2018

I did it a bit differently, same results:

import requests
from bs4 import BeautifulSoup
import csv
import os


url = 'http://www.cfbstats.com/2018/team/234/index.html'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
 
table = soup.findAll("table",{"class": "team-schedule"})[0]
trs = table.find_all('tr')
header = []
for n, tr in enumerate(trs):
    if n == 0:
        # Get Header
        ths = tr.find_all('th')
        for th in ths:
            header.append(th.text.strip())
        for item in header:
            print('{:22}'.format(item), end='')
        print()

        continue
    else:
        game_item = []
        tds = tr.find_all('td')
        for td in tds:
            game_item.append(td.text.strip())
    for item in game_item:
        print('{:22}'.format(item), end='')
    print()

Output:Date                  Opponent              Result                Game Time             Attendance
09/03/18              Virginia Tech         L 3-24                3:12                  75,237
09/08/18              Samford               W 36-26               3:51                  72,239
09/15/18              @ 17 Syracuse         L 7-30                3:37                  37,457
09/22/18              Northern Ill.         W 37-19               3:34                  65,633
09/29/18              @ Louisville          W 28-24               3:27                  52,798
10/06/18              @ Miami (Fla.)        L 27-28               4:01                  65,490
10/20/18              Wake Forest           W 38-17               3:34                  67,274
10/27/18              2 Clemson             L 10-59               3:47                  68,403
11/03/18              @ North Carolina St.  L 28-47               3:33                  57,600
11/10/18              @ 3 Notre Dame        L 13-42               3:22                  77,622
11/17/18              Boston College        W 22-21               3:31                  57,274
11/24/18              10 Florida            L 14-41               3:27                  71,953
@ : Away, + : Neutral Site

RE: Need help with Beautiful Soup - table - jlkmb - Dec-16-2018

Axel - Very interesting. Thank you!

Do you mind stepping through some questions/assumptions?

This creates a dataset from a table that takes all rows in the table, splits the string after a space and creates a new line. The rows are then appended.

datasets = []
mytable = table.find_all("tr")#[1:]
for row in mytable:
    text = str(row.get_text()).split('\n')
    datasets.append(text)

I'm having a real hard time following this one - and how did the headers get there?

_len = len(datasets)
for x in range(_len -1):
    t = datasets[x]
    print((t[1] + '\t' + t[2] + '\t' + t[5]).expandtabs(30))

I have learned some code for csv writer. Below is a sample.

 with open('test_cfbstats.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['Date', 'Opponent'])
    writer.writerows(data)

How would you suggest modifying for use in your code? I'm not sure if the writerow would be necessary, and the writerows would change to datasets?

File "cfbstats_larz.py", line 9, in <module>
soup = BeautifulSoup(page, 'html.parser')
NameError: name 'page' is not defined

Is is something I did?

(Dec-15-2018, 11:10 PM)Larz60+ Wrote: I did it a bit differently, same results:

import requests
from bs4 import BeautifulSoup
import csv
import os


url = 'http://www.cfbstats.com/2018/team/234/index.html'
r = requests.get(url)
soup = BeautifulSoup(page, 'html.parser')
 
table = soup.findAll("table",{"class": "team-schedule"})[0]
trs = table.find_all('tr')
header = []
for n, tr in enumerate(trs):
    if n == 0:
        # Get Header
        ths = tr.find_all('th')
        for th in ths:
            header.append(th.text.strip())
        for item in header:
            print('{:22}'.format(item), end='')
        print()

        continue
    else:
        game_item = []
        tds = tr.find_all('td')
        for td in tds:
            game_item.append(td.text.strip())
    for item in game_item:
        print('{:22}'.format(item), end='')
    print()

Output:Date                  Opponent              Result                Game Time             Attendance
09/03/18              Virginia Tech         L 3-24                3:12                  75,237
09/08/18              Samford               W 36-26               3:51                  72,239
09/15/18              @ 17 Syracuse         L 7-30                3:37                  37,457
09/22/18              Northern Ill.         W 37-19               3:34                  65,633
09/29/18              @ Louisville          W 28-24               3:27                  52,798
10/06/18              @ Miami (Fla.)        L 27-28               4:01                  65,490
10/20/18              Wake Forest           W 38-17               3:34                  67,274
10/27/18              2 Clemson             L 10-59               3:47                  68,403
11/03/18              @ North Carolina St.  L 28-47               3:33                  57,600
11/10/18              @ 3 Notre Dame        L 13-42               3:22                  77,622
11/17/18              Boston College        W 22-21               3:31                  57,274
11/24/18              10 Florida            L 14-41               3:27                  71,953
@ : Away, + : Neutral Site

My apologies for combining the replies, I don't know what happened.

RE: Need help with Beautiful Soup - table - Larz60+ - Dec-16-2018

That's me, I had renamed it page for my testing, and thought I had renamed everything, mieede line 9 which should read:

soup = BeautifulSoup(r.text, 'html.parser')

I also edited my original post

RE: Need help with Beautiful Soup - table - Axel_Erfurt - Dec-16-2018

(Dec-16-2018, 05:02 PM)jlkmb Wrote: and how did the headers get there?

The column headings are within <tr> </tr>

to save it as csv:
(change the delimiter to what you need)

from urllib.request import urlopen
from bs4 import BeautifulSoup as bsoup
import  csv
 
url = 'http://www.cfbstats.com/2018/team/234/index.html'
 
ofile = urlopen(url)
soup = bsoup(ofile, "html.parser", from_encoding='utf-8')
soup.prettify()
 
table = soup.find("table", attrs={"class":"team-schedule"})
 
datasets = []
mytable = table.find_all("tr")#[1:]
for row in mytable:
    text = str(row.get_text()).split('\n')
    datasets.append(text)

mypath = '/tmp/test_cfbstats.csv'
with open(mypath, 'w') as stream:
    writer = csv.writer(stream, delimiter='\t')
    _len = len(datasets)
    for x in range(_len -1):
        t = datasets[x]
        myrow = [t[1], t[2], t[5]]
        writer.writerow(myrow)

RE: Need help with Beautiful Soup - table - jlkmb - Dec-17-2018

Thanks Axel - Can you explain what the following code does?

    _len = len(datasets)
    for x in range(_len -1):
        t = datasets[x]
        myrow = [t[1], t[2], t[5]]

(Dec-16-2018, 06:50 PM)Axel_Erfurt Wrote:

(Dec-16-2018, 05:02 PM)jlkmb Wrote: and how did the headers get there?

The column headings are within <tr> </tr>

to save it as csv:
(change the delimiter to what you need)

from urllib.request import urlopen
from bs4 import BeautifulSoup as bsoup
import  csv
 
url = 'http://www.cfbstats.com/2018/team/234/index.html'
 
ofile = urlopen(url)
soup = bsoup(ofile, "html.parser", from_encoding='utf-8')
soup.prettify()
 
table = soup.find("table", attrs={"class":"team-schedule"})
 
datasets = []
mytable = table.find_all("tr")#[1:]
for row in mytable:
    text = str(row.get_text()).split('\n')
    datasets.append(text)

mypath = '/tmp/test_cfbstats.csv'
with open(mypath, 'w') as stream:
    writer = csv.writer(stream, delimiter='\t')
    _len = len(datasets)
    for x in range(_len -1):
        t = datasets[x]
        myrow = [t[1], t[2], t[5]]
        writer.writerow(myrow)

Thanks Larz. That makes sense. A few questions just to make sure I understand.

I hadn't seen enumerate yet. Interesting. Where is n defined? How about item? Can you explain the print statement?

  
trs = table.find_all('tr')
header = []
for n, tr in enumerate(trs):
    if n == 0:
        # Get Header
        ths = tr.find_all('th')
        for th in ths:
            header.append(th.text.strip())
        for item in header:
            print('{:22}'.format(item), end='')
        print()

How would I get the last line to not appear?

Larz - forgot to ask - do you implement csv the same way as Axel?

RE: Need help with Beautiful Soup - table - Larz60+ - Dec-17-2018

n is a name that I mage up. it could be any name that you wish. Same with item, tr, th trs
In some languages you have to declare variables before using them, in Python you can declase and use in same operation.
enumerate returns iteration number of loop.

Quote:How would I get the last line to not appear?

Not sure what you're asking here. if it's about the print() at the end, that just sends a newline, otherwise if you had additional print statements, they would end up on the same line, one after the other (the end='' suppresses a newline)

RE: Need help with Beautiful Soup - table - Axel_Erfurt - Dec-17-2018

He means the last line in the table

Quote:@ : Away, + : Neutral Site

that' why I used

_len = len(datasets)
    for x in range(_len -1):

You do not need CSV writer for the csv file.

from urllib.request import urlopen
from bs4 import BeautifulSoup as bsoup

url = 'http://www.cfbstats.com/2018/team/234/index.html'
ofile = urlopen(url)
soup = bsoup(ofile, "html.parser", from_encoding='utf-8')
soup.prettify()
  
table = soup.find("table", attrs={"class":"team-schedule"})
  
datasets = []
mytable = table.find_all("tr")
for row in mytable:
    text = str(row.get_text()).split('\n')
    datasets.append(text)
 
mypath = '/tmp/test_cfbstats.csv'
with open(mypath, 'w') as stream:
    _len = len(datasets)
    for x in range(_len -1):
        t = datasets[x]
        myrow = [t[1], t[2], t[5]]
        t = "\t".join(myrow)
        print(t.expandtabs(22))
        stream.write(t + "\n")  
    stream.close()

RE: Need help with Beautiful Soup - table - jlkmb - Dec-20-2018

Thanks guys. I appreciate your help!