scraping multiple pages of a website.

scraping multiple pages of a website. - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: scraping multiple pages of a website. (/thread-10820.html)

Pages: 1 2

scraping multiple pages of a website. - Blue Dog - Jun-07-2018

Hello All,

I have a website that as 26 pages, that star with 'a' and end with a 'z'.
this is y\the url of the site https://www.usa.gov/federal-agencies/a
I have a scraper that does what I want. I know to all of you python kings
it will be crude. what I need help on is how to scrape all 26 pages.
I have been all over the net looking for how to do it. Just not much out there.
I have found a few way of doing it, but none work. Wall

So here I am hoping someone can help. LOL

here is my code

#Python 3.7
from html.parser import HTMLParser
import requests
from bs4 import BeautifulSoup







r = requests.get('https://www.usa.gov/federal-agencies/a')
        
first_page = r.text

soup = BeautifulSoup(first_page, 'html.parser')

page_soup = soup

#page_soup.h1

#page_soup.p
boxes = page_soup.find_all('ul', {'class' : 'one_column_bullet'})
boxes[0].text.strip()


print(boxes)

I tryed all I could think of mostly many for loop.
here that works a bit. it print out the same page 26 times.

#Python 3.7
from html.parser import HTMLParser
import requests
from bs4 import BeautifulSoup
from string import ascii_lowercase


for letter in ascii_lowercase:
        r = requests.get('https://www.usa.gov/federal-agencies/' + letter +' ')
        
        first_page = r.text

        soup = BeautifulSoup(first_page, 'html.parser')

        page_soup = soup.find('h1')


        print(page_soup)

So if some one know how to use my to scrape 26 pages let me know.
Thank you
renny

RE: scraping multiple pages of a website. - Larz60+ - Jun-08-2018

The pages have the same URL base, with the letter added to the end.
https://www.usa.gov/federal-agencies/a
https://www.usa.gov/federal-agencies/b
etc.

>>> baseurl = 'https://www.usa.gov/federal-agencies/'
>>> valid_pages = 'abcdefghijlmnoprstuvw'
>>> for n in range(len(valid_pages)):
...     url = f'{baseurl}{valid_pages[n]}'
...     print(url)
...
https://www.usa.gov/federal-agencies/a
https://www.usa.gov/federal-agencies/b
https://www.usa.gov/federal-agencies/c
https://www.usa.gov/federal-agencies/d
https://www.usa.gov/federal-agencies/e
https://www.usa.gov/federal-agencies/f
https://www.usa.gov/federal-agencies/g
https://www.usa.gov/federal-agencies/h
https://www.usa.gov/federal-agencies/i
https://www.usa.gov/federal-agencies/j
https://www.usa.gov/federal-agencies/l
https://www.usa.gov/federal-agencies/m
https://www.usa.gov/federal-agencies/n
https://www.usa.gov/federal-agencies/o
https://www.usa.gov/federal-agencies/p
https://www.usa.gov/federal-agencies/r
https://www.usa.gov/federal-agencies/s
https://www.usa.gov/federal-agencies/t
https://www.usa.gov/federal-agencies/u
https://www.usa.gov/federal-agencies/v
https://www.usa.gov/federal-agencies/w
>>>

so can iterate over this:
pseudo code:
for char in valid_pages
within each page, the following can be used as an anchor:
<ul class="az-list group">

After that, all links (regular <a tags) up until the </ul>
are what you need.

so seems pretty simple.

RE: scraping multiple pages of a website. - Blue Dog - Jun-08-2018

You lost me, i will try to to use it
Thank you.
renny

Well I been at this for about 14 hours today. I am going to hit the sack.
This is what I got so far.

import requests
from bs4 import BeautifulSoup
from html.parser import HTMLParser


baseurl = requests.get('https://www.usa.gov/federal-agencies/')
valid_pages = 'abcdefghijlmnoprstuvw'
for n in range(len(valid_pages)):
        url = f'{baseurl}{valid_pages[n]}'
        print(url)
        page = soup = BeautifulSoup(url, 'html.parser')
       
        for page in soup.find_all('ul', {'class' : 'one_column_bullet'}):
        
        
                print(page)

this is what I get:

Response [200]>a
<Response [200]>b
<Response [200]>c
<Response [200]>d
<Response [200]>e
<Response [200]>f
<Response [200]>g
<Response [200]>h
<Response [200]>i
<Response [200]>j
<Response [200]>l
<Response [200]>m
<Response [200]>n
<Response [200]>o
<Response [200]>p
<Response [200]>r
<Response [200]>s
<Response [200]>t
<Response [200]>u
<Response [200]>v
<Response [200]>w
I do get to all the pages. soup does not work.
I want to thank you Larz60+ for your help. I will start back on it tomorrow.
renny

RE: scraping multiple pages of a website. - Larz60+ - Jun-08-2018

all the code that I showed is create a url for each page,
you still have to fetch it with requests and extract the links.
so instead of print, add your page scraping code.

RE: scraping multiple pages of a website. - Blue Dog - Jun-08-2018

Still up, try some new stuff:

import requests
from bs4 import BeautifulSoup
from html.parser import HTMLParser


baseurl = requests.get('https://www.usa.gov/federal-agencies/')
valid_pages = 'abcdefghijlmnoprstuvw'
for n in range(len(valid_pages)):
        url = f'{baseurl}{valid_pages[n]}'
        print(url)
        pages = soup = BeautifulSoup(url, 'html.parser')
        print(pages.title)
        for page in pages.find_all('ul', {'class' : 'z-list group'}):
        
                print(page.a)
                print(page)

here is the out put, bs4 is not kicking in. I just to beat to mess with ti to night.

<Response [200]>a
None
<Response [200]>b
None
<Response [200]>c
None
<Response [200]>d
None
<Response [200]>e
None
<Response [200]>f
None
<Response [200]>g
None
<Response [200]>h
None
<Response [200]>i
None
<Response [200]>j
None
<Response [200]>l
None
<Response [200]>m
None
<Response [200]>n
None
<Response [200]>o
None
<Response [200]>p
None

I just had a brain fart, maybe my for loop is not working.
Tomorrow is another day.

RE: scraping multiple pages of a website. - Larz60+ - Jun-08-2018

Ok,

Couldn't resit writing this one:
This code can be run by itself, or imported into another module.
Once run, all that's needed in a class that wants to use the index is to load the json file into
a dictionary (see testit)

create a project directory and src directory

mkdir FederalAgencies
cd FederalAgencies
mkdir src

add to FederalAgencies directory:

module __init__.py

FederalAgencies/
    __init__.py
    src/
        __init__.py
        BuildFederalAgencyIndex.py
        FederalPaths.py

add to src directory:

1. create an empty __init__.py file

save this in src directory as FederalPaths.py

from pathlib import Path
import os


class FederalPaths:
    def __init__(self):
        # Make sure start path is  properly set
        self.set_starting_dir()
        self.homepath = Path('.')
        self.rootpath = self.homepath / '..'
        self.datapath = self.rootpath / 'data'
        self.datapath.mkdir(exist_ok=True)
        self.outpath = self.datapath / 'json'
        self.outpath.mkdir(exist_ok=True)

        self.gov_urlbase = 'https://www.usa.gov/'
        self.baseurl = 'https://www.usa.gov/federal-agencies/'
        self.valid_pages = 'abcdefghijlmnoprstuvw'
        self.fed_index_file = self.outpath / 'FedIndex.json'

    def set_starting_dir(self):
        path = Path(__file__).resolve()
        path, file = os.path.split(path)
        path = os.path.abspath(path)
        os.chdir(path)

def testit():
    FederalPaths()

if __name__ == '__main__':
    testit()

save this one in src directory as BuildFederalAgencyIndex.py

import FederalPaths
import requests
from bs4 import BeautifulSoup
import sys
import json


class BuildFederalAgencyIndex:
    def __init__(self):
        self.fpath = FederalPaths.FederalPaths()
        self.fed_index = {}
        self.valid_pages = 'abcdefghijlmnoprstuvw'

        self.build_index()
    
    def build_index(self):
        for n in range(len(self.valid_pages)):
            alpha = self.valid_pages[n]
            URL = f'{self.fpath.baseurl}{alpha}'
            self.fed_index[alpha] = {}
            try:
                response = requests.get(URL)
                soup = BeautifulSoup(response.content, 'lxml')
                ulist = soup.find('ul', {"class": "one_column_bullet"} )
                links = ulist.find_all('a')
                for link in links:
                    suffix = link.get('href')
                    href = f'{self.fpath.gov_urlbase}{suffix}'
                    self.fed_index[alpha][link.text] = href
            except:
                print(f'error: {sys.exc_info()[0]}')

        with self.fpath.fed_index_file.open('w') as jout:
            json.dump(self.fed_index, jout)


def testit():
    # Create json file
    fa = BuildFederalAgencyIndex()

    # test json file
    with fa.fpath.fed_index_file.open() as fp:
        fed_index = json.load(fp)
    
    # Show all entries for 'c'
    for name, url in fed_index['c'].items():
        print(f'name: {name}, url: {url}')

    # Individual entry:
    print(f"\nIndividual entry url for Court of Appeals for Veterans Claims: {fed_index['c']['Court of Appeals for Veterans Claims']}")

if __name__ == '__main__':
    testit()

test run:

cd FederalAgencies/src

This will create json file (in data/json directory) and print out all 'C' indexes:
directories will be created first time run

python BuildFederalAgencyIndex.py

results:

Output:name: California, url: https://www.usa.gov//state-government/california
name: Capitol Police, url: https://www.usa.gov//federal-agencies/u-s-capitol-police
name: Capitol Visitor Center, url: https://www.usa.gov//federal-agencies/u-s-capitol-visitor-center
name: Career, Technical, and Adult Education, Office of, url: https://www.usa.gov//federal-agencies/office-of-career-technical-and-adult-education
name: Census Bureau, url: https://www.usa.gov//federal-agencies/u-s-census-bureau
name: Center for Food Safety and Applied Nutrition, url: https://www.usa.gov//federal-agencies/center-for-food-safety-and-applied-nutrition
name: Center for Nutrition Policy and Promotion (CNPP), url: https://www.usa.gov//federal-agencies/center-for-nutrition-policy-and-promotion
name: Centers for Disease Control and Prevention (CDC), url: https://www.usa.gov//federal-agencies/centers-for-disease-control-and-prevention
name: Centers for Medicare and Medicaid Services (CMS), url: https://www.usa.gov//federal-agencies/centers-for-medicare-and-medicaid-services
name: Central Command (CENTCOM), url: https://www.usa.gov//federal-agencies/u-s-central-command
name: Central Intelligence Agency (CIA), url: https://www.usa.gov//federal-agencies/central-intelligence-agency
name: Chemical Safety Board, url: https://www.usa.gov//federal-agencies/u-s-chemical-safety-board
name: Chief Acquisition Officers Council, url: https://www.usa.gov//federal-agencies/chief-acquisition-officers-council
name: Chief Financial Officers Council, url: https://www.usa.gov//federal-agencies/chief-financial-officers-council
name: Chief Human Capital Officers Council, url: https://www.usa.gov//federal-agencies/chief-human-capital-officers-council
name: Chief Information Officers Council, url: https://www.usa.gov//federal-agencies/chief-information-officers-council
name: Child Support Enforcement, Office of (OCSE), url: https://www.usa.gov//federal-agencies/office-of-child-support-enforcement
name: Circuit Courts of Appeal, url: https://www.usa.gov//federal-agencies/u-s-courts-of-appeal
name: Citizens' Stamp Advisory Committee, url: https://www.usa.gov//federal-agencies/citizens-stamp-advisory-committee
name: Citizenship and Immigration Services (USCIS), url: https://www.usa.gov//federal-agencies/u-s-citizenship-and-immigration-services
name: Civil Rights, Department of Education Office of, url: https://www.usa.gov//federal-agencies/office-for-civil-rights-department-of-education
name: Civil Rights, Department of Health and Human Services Office for, url: https://www.usa.gov//federal-agencies/office-for-civil-rights-department-of-health-and-human-services
name: Coast Guard, url: https://www.usa.gov//federal-agencies/u-s-coast-guard
name: Colorado, url: https://www.usa.gov//state-government/colorado
name: Commerce Department (DOC), url: https://www.usa.gov//federal-agencies/u-s-department-of-commerce
name: Commission of Fine Arts, url: https://www.usa.gov//federal-agencies/u-s-commission-of-fine-arts
name: Commission on Civil Rights, url: https://www.usa.gov//federal-agencies/commission-on-civil-rights
name: Commission on International Religious Freedom, url: https://www.usa.gov//federal-agencies/u-s-commission-on-international-religious-freedom
name: Commission on Presidential Scholars, url: https://www.usa.gov//federal-agencies/commission-on-presidential-scholars
name: Commission on Security and Cooperation in Europe (Helsinki Commission), url: https://www.usa.gov//federal-agencies/commission-on-security-and-cooperation-in-europe-helsinki-commission
name: Committee for the Implementation of Textile Agreements, url: https://www.usa.gov//federal-agencies/committee-for-the-implementation-of-textile-agreements
name: Committee on Foreign Investment in the United States, url: https://www.usa.gov//federal-agencies/committee-on-foreign-investment-in-the-united-states
name: Commodity Futures Trading Commission (CFTC), url: https://www.usa.gov//federal-agencies/u-s-commodity-futures-trading-commission
name: Community Oriented Policing Services (COPS), url: https://www.usa.gov//federal-agencies/community-oriented-policing-services
name: Community Planning and Development, url: https://www.usa.gov//federal-agencies/office-of-community-planning-and-development
name: Compliance, Office of, url: https://www.usa.gov//federal-agencies/office-of-compliance
name: Comptroller of the Currency, Office of (OCC), url: https://www.usa.gov//federal-agencies/office-of-the-comptroller-of-the-currency
name: Computer Emergency Readiness Team (US CERT), url: https://www.usa.gov//federal-agencies/computer-emergency-readiness-team
name: CongressU.S. House of Representatives, url: https://www.usa.gov//federal-agencies/u-s-house-of-
representatives
name: CongressU.S. Senate, url: https://www.usa.gov//federal-agencies/u-s-senate
name: Congressional Budget Office (CBO), url: https://www.usa.gov//federal-agencies/congressional-budget-office
name: Congressional Research Service, url: https://www.usa.gov//federal-agencies/congressional-research-service
name: Connecticut, url: https://www.usa.gov//state-government/connecticut
name: Consular Affairs, Bureau of, url: https://www.usa.gov//federal-agencies/bureau-of-consular-affairs
name: Consumer Financial Protection Bureau, url: https://www.usa.gov//federal-agencies/consumer-financial-protection-bureau
name: Consumer Product Safety Commission (CPSC), url: https://www.usa.gov//federal-agencies/consumer-product-safety-commission
name: Coordinating Council on Juvenile Justice and Delinquency Prevention, url: https://www.usa.gov//federal-agencies/coordinating-council-on-juvenile-justice-and-delinquency-prevention
name: Copyright Office, url: https://www.usa.gov//federal-agencies/copyright-office
name: Corporation for National and Community Service, url: https://www.usa.gov//federal-agencies/corporation-for-national-and-community-service
name: Corps of Engineers, url: https://www.usa.gov//federal-agencies/u-s-army-corps-of-engineers
name: Council of Economic Advisers, url: https://www.usa.gov//federal-agencies/council-of-economic-advisers
name: Council of the Inspectors General on Integrity and Efficiency, url: https://www.usa.gov//federal-agencies/council-of-the-inspectors-general-on-integrity-and-efficiency
name: Council on Environmental Quality, url: https://www.usa.gov//federal-agencies/council-on-environmental-quality
name: Court Services and Offender Supervision Agency for the District of Columbia, url: https://www.usa.gov//federal-agencies/court-services-and-offender-supervision-agency-for-the-district-of-columbia
name: Court of Appeals for Veterans Claims, url: https://www.usa.gov//federal-agencies/u-s-court-of-appeals-for-veterans-claims
name: Court of Appeals for the Armed Forces, url: https://www.usa.gov//federal-agencies/court-of-appeals-for-the-armed-forces
name: Court of Appeals for the Federal Circuit, url: https://www.usa.gov//federal-agencies/court-of-appeals-for-the-federal-circuit
name: Court of Federal Claims, url: https://www.usa.gov//federal-agencies/court-of-federal-claims
name: Court of International Trade, url: https://www.usa.gov//federal-agencies/court-of-international-trade
name: Customs and Border Protection, url: https://www.usa.gov//federal-agencies/u-s-customs-and-border-protection

Individual entry url for Court of Appeals for Veterans Claims: https://www.usa.gov//federal-agencies/u-s-court-of-appeals-for-veterans-claims

RE: scraping multiple pages of a website. - buran - Jun-08-2018

Larz60+ has done wonderful job writing this for you, but I think 'it's too complicated for something that can be done with couple of lines (i.e. OOP, etc is overkill)
first of all - your code. The problem is on line#11.
Here is it with some small changes

import requests
from bs4 import BeautifulSoup
from string import ascii_lowercase
 
base_url  = 'https://www.usa.gov/federal-agencies/'
for letter in ascii_lowercase:
    url = '{}{}'.format(base_url, letter)
    print(url)
    resp = requests.get(url)
    soup = BeautifulSoup(resp.text, 'html.parser')
    for ul in soup.find_all('ul', {'class' : 'one_column_bullet'}):
        print(ul))

Now the nice part
If you inspect the page and what it loads you will notice that it gets all the information as json
so

import requests

url = "https://www.usa.gov/ajax/federal-agencies/autocomplete"
resp = requests.get(url)
print(resp.json())

if you want you can save the json resp as file. Anyway, you get all 548 agencies and the respective url in one get request as a json file.

RE: scraping multiple pages of a website. - Larz60+ - Jun-08-2018

Buran,

With all respect for your concise response,
Thanks, Great way to start my day!

RE: scraping multiple pages of a website. - buran - Jun-08-2018

(Jun-08-2018, 01:55 PM)Larz60+ Wrote: Thanks, Great way to start my day!

Sorry for any misstep :-)

RE: scraping multiple pages of a website. - Larz60+ - Jun-08-2018

Not A problem.
I was like I was building a bridge over a body of water that already had stepping stones in place.