scraping multiple pages of a website. - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: scraping multiple pages of a website. (/thread-10820.html) Pages:
1
2
|
scraping multiple pages of a website. - Blue Dog - Jun-07-2018 Hello All, I have a website that as 26 pages, that star with 'a' and end with a 'z'. this is y\the url of the site https://www.usa.gov/federal-agencies/a I have a scraper that does what I want. I know to all of you python kings it will be crude. what I need help on is how to scrape all 26 pages. I have been all over the net looking for how to do it. Just not much out there. I have found a few way of doing it, but none work. So here I am hoping someone can help. here is my code #Python 3.7 from html.parser import HTMLParser import requests from bs4 import BeautifulSoup r = requests.get('https://www.usa.gov/federal-agencies/a') first_page = r.text soup = BeautifulSoup(first_page, 'html.parser') page_soup = soup #page_soup.h1 #page_soup.p boxes = page_soup.find_all('ul', {'class' : 'one_column_bullet'}) boxes[0].text.strip() print(boxes)I tryed all I could think of mostly many for loop. here that works a bit. it print out the same page 26 times. #Python 3.7 from html.parser import HTMLParser import requests from bs4 import BeautifulSoup from string import ascii_lowercase for letter in ascii_lowercase: r = requests.get('https://www.usa.gov/federal-agencies/' + letter +' ') first_page = r.text soup = BeautifulSoup(first_page, 'html.parser') page_soup = soup.find('h1') print(page_soup)So if some one know how to use my to scrape 26 pages let me know. Thank you renny RE: scraping multiple pages of a website. - Larz60+ - Jun-08-2018 The pages have the same URL base, with the letter added to the end. https://www.usa.gov/federal-agencies/a https://www.usa.gov/federal-agencies/b etc. >>> baseurl = 'https://www.usa.gov/federal-agencies/' >>> valid_pages = 'abcdefghijlmnoprstuvw' >>> for n in range(len(valid_pages)): ... url = f'{baseurl}{valid_pages[n]}' ... print(url) ... https://www.usa.gov/federal-agencies/a https://www.usa.gov/federal-agencies/b https://www.usa.gov/federal-agencies/c https://www.usa.gov/federal-agencies/d https://www.usa.gov/federal-agencies/e https://www.usa.gov/federal-agencies/f https://www.usa.gov/federal-agencies/g https://www.usa.gov/federal-agencies/h https://www.usa.gov/federal-agencies/i https://www.usa.gov/federal-agencies/j https://www.usa.gov/federal-agencies/l https://www.usa.gov/federal-agencies/m https://www.usa.gov/federal-agencies/n https://www.usa.gov/federal-agencies/o https://www.usa.gov/federal-agencies/p https://www.usa.gov/federal-agencies/r https://www.usa.gov/federal-agencies/s https://www.usa.gov/federal-agencies/t https://www.usa.gov/federal-agencies/u https://www.usa.gov/federal-agencies/v https://www.usa.gov/federal-agencies/w >>>so can iterate over this: pseudo code: for char in valid_pages within each page, the following can be used as an anchor: <ul class="az-list group"> After that, all links (regular <a tags) up until the </ul> are what you need. so seems pretty simple. RE: scraping multiple pages of a website. - Blue Dog - Jun-08-2018 You lost me, i will try to to use it Thank you. renny Well I been at this for about 14 hours today. I am going to hit the sack. This is what I got so far. import requests from bs4 import BeautifulSoup from html.parser import HTMLParser baseurl = requests.get('https://www.usa.gov/federal-agencies/') valid_pages = 'abcdefghijlmnoprstuvw' for n in range(len(valid_pages)): url = f'{baseurl}{valid_pages[n]}' print(url) page = soup = BeautifulSoup(url, 'html.parser') for page in soup.find_all('ul', {'class' : 'one_column_bullet'}): print(page)this is what I get: Response [200]>a <Response [200]>b <Response [200]>c <Response [200]>d <Response [200]>e <Response [200]>f <Response [200]>g <Response [200]>h <Response [200]>i <Response [200]>j <Response [200]>l <Response [200]>m <Response [200]>n <Response [200]>o <Response [200]>p <Response [200]>r <Response [200]>s <Response [200]>t <Response [200]>u <Response [200]>v <Response [200]>w I do get to all the pages. soup does not work. I want to thank you Larz60+ for your help. I will start back on it tomorrow. renny RE: scraping multiple pages of a website. - Larz60+ - Jun-08-2018 all the code that I showed is create a url for each page, you still have to fetch it with requests and extract the links. so instead of print, add your page scraping code. RE: scraping multiple pages of a website. - Blue Dog - Jun-08-2018 Still up, try some new stuff: import requests from bs4 import BeautifulSoup from html.parser import HTMLParser baseurl = requests.get('https://www.usa.gov/federal-agencies/') valid_pages = 'abcdefghijlmnoprstuvw' for n in range(len(valid_pages)): url = f'{baseurl}{valid_pages[n]}' print(url) pages = soup = BeautifulSoup(url, 'html.parser') print(pages.title) for page in pages.find_all('ul', {'class' : 'z-list group'}): print(page.a) print(page)here is the out put, bs4 is not kicking in. I just to beat to mess with ti to night. <Response [200]>a None <Response [200]>b None <Response [200]>c None <Response [200]>d None <Response [200]>e None <Response [200]>f None <Response [200]>g None <Response [200]>h None <Response [200]>i None <Response [200]>j None <Response [200]>l None <Response [200]>m None <Response [200]>n None <Response [200]>o None <Response [200]>p None I just had a brain fart, maybe my for loop is not working. Tomorrow is another day. RE: scraping multiple pages of a website. - Larz60+ - Jun-08-2018 Ok, Couldn't resit writing this one: This code can be run by itself, or imported into another module. Once run, all that's needed in a class that wants to use the index is to load the json file into a dictionary (see testit) create a project directory and src directory mkdir FederalAgencies cd FederalAgencies mkdir srcadd to FederalAgencies directory: module __init__.py FederalAgencies/ __init__.py src/ __init__.py BuildFederalAgencyIndex.py FederalPaths.pyadd to src directory: 1. create an empty __init__.py file save this in src directory as FederalPaths.py from pathlib import Path import os class FederalPaths: def __init__(self): # Make sure start path is properly set self.set_starting_dir() self.homepath = Path('.') self.rootpath = self.homepath / '..' self.datapath = self.rootpath / 'data' self.datapath.mkdir(exist_ok=True) self.outpath = self.datapath / 'json' self.outpath.mkdir(exist_ok=True) self.gov_urlbase = 'https://www.usa.gov/' self.baseurl = 'https://www.usa.gov/federal-agencies/' self.valid_pages = 'abcdefghijlmnoprstuvw' self.fed_index_file = self.outpath / 'FedIndex.json' def set_starting_dir(self): path = Path(__file__).resolve() path, file = os.path.split(path) path = os.path.abspath(path) os.chdir(path) def testit(): FederalPaths() if __name__ == '__main__': testit()save this one in src directory as BuildFederalAgencyIndex.py import FederalPaths import requests from bs4 import BeautifulSoup import sys import json class BuildFederalAgencyIndex: def __init__(self): self.fpath = FederalPaths.FederalPaths() self.fed_index = {} self.valid_pages = 'abcdefghijlmnoprstuvw' self.build_index() def build_index(self): for n in range(len(self.valid_pages)): alpha = self.valid_pages[n] URL = f'{self.fpath.baseurl}{alpha}' self.fed_index[alpha] = {} try: response = requests.get(URL) soup = BeautifulSoup(response.content, 'lxml') ulist = soup.find('ul', {"class": "one_column_bullet"} ) links = ulist.find_all('a') for link in links: suffix = link.get('href') href = f'{self.fpath.gov_urlbase}{suffix}' self.fed_index[alpha][link.text] = href except: print(f'error: {sys.exc_info()[0]}') with self.fpath.fed_index_file.open('w') as jout: json.dump(self.fed_index, jout) def testit(): # Create json file fa = BuildFederalAgencyIndex() # test json file with fa.fpath.fed_index_file.open() as fp: fed_index = json.load(fp) # Show all entries for 'c' for name, url in fed_index['c'].items(): print(f'name: {name}, url: {url}') # Individual entry: print(f"\nIndividual entry url for Court of Appeals for Veterans Claims: {fed_index['c']['Court of Appeals for Veterans Claims']}") if __name__ == '__main__': testit()test run: cd FederalAgencies/srcThis will create json file (in data/json directory) and print out all 'C' indexes: directories will be created first time run python BuildFederalAgencyIndex.pyresults:
RE: scraping multiple pages of a website. - buran - Jun-08-2018 Larz60+ has done wonderful job writing this for you, but I think 'it's too complicated for something that can be done with couple of lines (i.e. OOP, etc is overkill) first of all - your code. The problem is on line#11. Here is it with some small changes import requests from bs4 import BeautifulSoup from string import ascii_lowercase base_url = 'https://www.usa.gov/federal-agencies/' for letter in ascii_lowercase: url = '{}{}'.format(base_url, letter) print(url) resp = requests.get(url) soup = BeautifulSoup(resp.text, 'html.parser') for ul in soup.find_all('ul', {'class' : 'one_column_bullet'}): print(ul))Now the nice part If you inspect the page and what it loads you will notice that it gets all the information as json so import requests url = "https://www.usa.gov/ajax/federal-agencies/autocomplete" resp = requests.get(url) print(resp.json())if you want you can save the json resp as file. Anyway, you get all 548 agencies and the respective url in one get request as a json file. RE: scraping multiple pages of a website. - Larz60+ - Jun-08-2018 Buran, With all respect for your concise response, Thanks, Great way to start my day! RE: scraping multiple pages of a website. - buran - Jun-08-2018 (Jun-08-2018, 01:55 PM)Larz60+ Wrote: Thanks, Great way to start my day!Sorry for any misstep :-) RE: scraping multiple pages of a website. - Larz60+ - Jun-08-2018 Not A problem. I was like I was building a bridge over a body of water that already had stepping stones in place. |